HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
MASTER THESIS
Unified Model of Detection and Recognition
for Han Nom Characters
NGUYỄN VĂN LỢI
loi.nv142732@sis.hust.edu.vn
Information Systems
Supervisor:
PhD
.
Nguy
n Th
Oanh
School:
OI
CT
HÀ NỘI, 03/2021
Signature
CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập – Tự do – Hạnh phúc
BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ
Họ và tên tác giả luận văn: Nguyễn Văn Lợi
Đề tài luận văn: Mô hình hợp nhất phát hiện và nhận dạng ký tự Hán Nôm
Chuyên ngành: Hệ thống thông tin
Mã số SV: CBC19016
Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận tác giả
đã sửa chữa, bsung luận văn theo biên bản họp Hội đồng ngày 24/04/2021 với c
nội dung sau:
- Đính chính thêm nội dung về sơ đồ tổng thể cho hệ thống và tiêu đề luận văn
Ngày 03 tháng 05 năm 2021
Giáo viên hướng dẫn Tác giả luận văn
TS. Nguyễn Thị Oanh Nguyễn Văn Lợi
CHỦ TỊCH HỘI ĐỒNG
PGS. TS. Phạm Văn Hải
Lời cảm ơn
Bằng lòng biết ơn sâu sắc, tôi xin được gửi lời cảm ơn chân thành nhất tới TS. Nguyễn Thị
Oanh, người đã giúp tôi rất nhiều trong khoảng thời gian vừa qua, từ việc chỉ bảo tận tình
thuyết, kỹ năng cũng như thái độ cần thiết đtôi có thể hoàn thành tốt nhất công trình luận văn
thạc này. Trong quá trình thực hiện luận văn, luôn người nhiệt tình, chu đáo hỗ trợ
hết mình, giúp tôi vượt qua những chặng đường, những khoảng thời gian khó khăn nhất.
Đồng thời tôi xin gửi lời cảm ơn sâu sắc nhất đến gia đình và bạn đã luôn hậu phương
vững chắc hỗ trợ tôi trong suốt thời gian vừa qua, là một nguồn động lực to lớn giúp tôi có được
thành công ngày hôm nay.
Tôi xin chân thành cảm ơn!
Nội, ngày 03 tháng 05 năm 2021
Nguyễn Văn Lợi
HỌC VIÊN
Ký và ghi rõ họ tên
Tóm tắt nội dung luận văn
Trong luận văn thạc này, đầu tiên i sẽ trình bày bài toán, vấn đề mà tôi đang quan tâm cũng
chính là vấn đề mà tôi sẽ giải quyết đó vấn đvề phát hiện và nhận dạng các dãy tự Hán Nôm
trong ảnh kỹ thuật số bất kỳ. Cụ thể, tôi sẽ trình bày về bối cảnh lịch sử hình thành vấn đề,
do cũng như động lực thúc đẩy i nghiên cứu vvấn đề. Sau đó tôi strình bày bài toán, định
nghĩa vấn đề một cách cthể. trong luận văn này, tôi sử dụng kết hợp các phương pháp
nghiên cứu khoa học, các phương pháp thực hiện phổ biến như: phương pháp phân tích tổng
hợp, phương pháp so sánh, phương pháp dùng số liệu, phương pháp liệt kê. Các phương pháp sẽ
được áp dụng xen kẽ, thường xuyên, đầy đủ và đúng đắn xuyên suốt nội dung trình bày trong luận
văn. dụ như việc áp dụng cả bốn phương pháp thực hiện nghiên cứu trong giai đoạn m hiểu
các nghiên cứu, các đề tài liên quan và so sánh tìm ra đề xuất giải pháp phù hợp. Cũng có thể thấy
sự hiện diện của phương pháp phân tích tổng hợp, so sánh, ... trong quá trình trình bày, mô tả chi
tiết giải pháp đề xuất cũng như quá trình cài đặt thực nghiệm, đánh giá, so sánh kết quả và những
phát triển định hướng trong tương lai. Để nói chi tiết hơn, mô hình mà tôi đề xuất là một mô hình
mạng nơ-ron được lấy cảm ứng từ những hình phát hiện nhận dạng tự hiệu quả và phổ
biến hiện nay dụ như hình CRNN, CRAFTS, FOTS, EAST, TextBoxes++, Tuy nhiên,
mô hình do tôi đề xuất sẽ có những cải tiến đáng kể để nhằm giảm thiểu việc tiêu hao tài nguyên
phần cứng, cải thiện độ chính xác, tương thích với đối tượng mục tiêu và tạo tiền đề cho việc xây
dựng những đường ống dữ liệu, mô-đun liên kết kiến trúc hiệu quả hơn trong tương lai. Kết quả
sau khi thực hiện luận văn thạc sĩ là một hệ thống hoàn chỉnh đã trả lời, đưa ra lời giải cho những
vấn đề được đặt ra ban đầu. cũng hoàn toàn mang tính thực tiễn thể dễ dàng ứng dụng
vào trong đời sống, giúp làm giảm thiểu gánh nặng công việc cho con người hơn. Cũng chính
lý do đó, tôi đã xem xét các khía cạnh một cách chi tiết, rõ ràng để có thể tiến hành phát triển, mở
rộng hệ thống ra trong tương lại:
Xây dựng phần hệ thống dịch
Mở rộng cho tiếng Việt và các ngôn ngữ khác
Xây dựng một mô hình thu gọn cho các thiết bị di động
Xây dựng nền tảng dịch vụ, ứng dụng cho hệ thống
Contents
Acronyms
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Content overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Theoretical Basis 4
2.1 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Artificial Neuron . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Feedforward Neural Network . . . . . . . . . . . . . . . . . . . 6
2.1.3 Convolutional Neural Network . . . . . . . . . . . . . . . . . . 7
2.1.4 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . 11
2.2 Region of Interest pooling . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Conventional RoI pooling . . . . . . . . . . . . . . . . . . . . 16
2.2.2 RoI Align . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Other popular RoI pooling techniques . . . . . . . . . . . . . . 18
2.3 Detection and Segmentation . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Sampling and Interpolation . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Training and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Related Work 29
3.1 Detection and text-spotting models . . . . . . . . . . . . . . . . . . . 31
3.1.1 CharNet (Convolution Character Network) . . . . . . . . . . . 31
3.1.2 PMTD (Pyramid Mask Text Detector) . . . . . . . . . . . . . 34
3.1.3 OBD (Orderless Box Discretization Network) . . . . . . . . . 37
3.1.4 FOTS (Fast Oriented Text Spotting) . . . . . . . . . . . . . . 40
3.1.5 ContourNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.6 CRAFT and CRAFTS . . . . . . . . . . . . . . . . . . . . . . 46
3.1.7 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Recognition models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 CRNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 RARE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Proposed Solutions and Improvements 54
4.1 Proposed solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.1 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.1.2 Adaptive solution . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Unified Model for Arbitrary-shape Text Spotting . . . . . . . . . . . 59
4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.2 Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.3 Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.4 Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Implementation and Evaluation 73
5.1 Experimental models . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 ReCTS2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.2 SynthText . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.3 Chinese Synthetic String dataset . . . . . . . . . . . . . . . . 75
5.2.4 Chinese Street View Text dataset . . . . . . . . . . . . . . . . 77
5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 Development environemt . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 Training strategy . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4.1 Results of the detection models . . . . . . . . . . . . . . . . . 79
5.4.2 Results of the UMATS text-spotting model . . . . . . . . . . . 84
6 Conclusions and Future work 91
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
List of Figures 92
List of Tables 95
Bibliography 96
Acronyms
AI Artificial Intelligence.
AMP Automatic Mixed Precision.
ANN Artificial Neural Network.
API Application Programming Interface.
BN Batch Normalization.
CCL Connected Components Labeling.
CNN Convolutional Neural Network.
CTC Connectionist Temporal Classification.
DL Deep Learning.
DNN Deep Neural Network.
DPM Deformable Part-based Model.
FCN Fully Connected Neural Network.
FNN Feedforward Neural Network.
FPN Feature Pyramid Network.
GRU Gated Recurrent Unit.
GT Ground Truth.
IoU Intersection Over Union.
KE Key Edge.
LSTM Long Short-Term Memory.
MSE Mean Square Error.
MTL Matching-Type Learning.
NED Normalized Edit Distance.
NMS Non-Maximum Suppression.
OBD Orderless Box Discretization.
OCR Optical Character Recognition.
OHEM On-line Hard Negative Mining.
ReLU Rectified Linear Unit.
RNN Recurrent Neural Network.
RoI Region of Interest.
RPN Region Proposal Network.
RPP Rescoring and Post-Processing.
STN Spatial Transformer Network.
SVM Supported Vector Machine.
TPS Thin-Plate Spline.
Chapter 1
Introduction
1.1 Introduction
Nowadays, there are more than 6 billion people on Earth who are currently using
smartphones and other small handheld information devices for their everyday lives.
More than that, most of the devices themselves have cameras built into them. Not
only that, the establishment of surveillance camera systems or the advent of social
networks in the era of the fourth industrial revolution has led to an explosion in image
resources. At the same time, machine learning and computer vision algorithms have
continuously made great progress year by year. As a result, high-performance and
low-cost object detection and recognition systems are continually being introduced
and widely applied in many areas of life. One of the subjects attracting the attention
of the majority of researchers is the characters. They are special objects, are means
of exchanging information. The effective detection and reading of languages simplify
many applications of life.
Some applications can be mentioned such as locating and measuring the geo-
graphical position of an object through reading the characters related to it, thereby
helping to locate the object, helping to extract the necessary information about
the position, helping to detect the location of dangerous objects, etc. Another ap-
plication is the classification of images, classifying objects based on the sequence
of characters assigned to them. In addition, object tracking applications based on
detecting and identifying the number assigned to the object such as detecting num-
ber plates, detecting objects with labels are also interesting applications. Another
potential application is the reading and translation of character sequences on docu-
ments, stelae, signs, or historical sites. The development and efficiency enhancement
of this model is the key to building automatic reading systems in the future and is
the springboard for the rest of the applications to thrive.
In fact, most research on automatic character spotting and translation focuses
on English. On the other hand, Chinese is the most popular and frequently used
language in the world. Meanwhile, Nom - the official language in the past of the
Vietnamese people before Vietnamese was born and popularized, is being faded
over time. Moreover, many documents and historical structures of Viet Nam use
and contain Han Nom characters. The development of Han Nom spotting and
translation applications brings a lot of great value.
In recent years, there have been many models, tools, and systems that allow
detecting, recognizing, even translate any characters in images from one language
to another. However, there are still many problems that need to be solved as well
as many development directions to further enhance the efficiency and accuracy of
1
the system. For example, some systems require a periodic fee to be used, some have
limited usable functionality, others only work well under specific conditions.
For these reasons, in this thesis, we will focus on building a model to detect and
recognize characters, hieroglyphs in general, or Han Nom characters in particular.
Figure 1.1: Problem definition: localizing regions of lines of characters and convert-
ing them into encoded strings of characters [1]
1.2 Tasks
In order to build the above model, we will perform the following tasks:
Firstly, we research and evaluate different scene text detection and recognition
methods suitable for hieroglyphs.
Secondly, we consider the pros and cons of the methods. we find strengths to
promote or reuse and find weaknesses to replace or eliminate in those methods.
Then, we consider the problem that is being solved and combine it with the
existing knowledge to propose a total and reasonably effective solution.
After that, we describe the proposed solution in detail.
Then, we choose the appropriate development environment, choose the re-
sources needed to implement and improve the solution.
Lastly, we evaluate solutions based on popular standards, comment on the
achieved results, and propose future improvements and development direc-
tions.
1.3 Scope of the study
Here are some of the scopes that we place in the study of this topic:
Solution is designed to detect and recognize sequences of characters instead of
detecting and recognizing individual letters.
2
Solution is focused on the problem of detecting and recognizing Han Nom
characters. However, there is no exception to the possibility that it will incor-
porate the ability to detect and recognize Latin letters, numbers, and special
characters to facilitate the evaluation process.
Solution is designed to target objects in the most general conditions: in natural
environment, of arbitrary shape, in different lighting conditions, etc.
1.4 Content overview
The remaining chapters of the thesis are organized as follows:
Chapter 2 briefly introduces some relevant fundamental theoretical bases about
artificial neurons, convolutional neural networks, RoI pooling, etc.
Chapter 3 presents a number of related studies, comparing and evaluating
them.
Chapter 4 describes and analyzes in detail the proposed solution.
Chapter 5 presents specific configurations, implementation process, evaluation
methods, associated auxiliary modifications and comparison and evaluation of
different models.
Chapter 6 will present the contributions made, the outstanding issues, and the
development orientation of the project in the future.
3
Chapter 2
Theoretical Basis
2.1 Artificial Neural Network
Artificial Neural Networks (ANNs), usually simply called neural networks, are
computing systems vaguely inspired by the biological neural networks that constitute
animal brains. An ANN is based on a collection of connected units or nodes called
artificial neurons.
2.1.1 Artificial Neuron
An artificial neuron is a connection point in an ANN. ANNs, like the human
body’s biological neural network, have a layered architecture and each network node
(connection point) has the capability to process input and forward output to other
nodes in the network. In both artificial and biological architectures, the nodes are
called neurons and the connections are characterized by synaptic weights, which
represent the significance of the connection. As new data is received and processed,
the synaptic weights change and this is how learning occurs.
Artificial neurons are modeled after the hierarchical arrangement of neurons in
biological sensory systems. In the visual system, for example, light input passes
through neurons in successive layers of the retina before being passed to neurons in
the thalamus of the brain and then on to neurons in the brain’s visual cortex. As the
neurons pass signals through an increasing number of layers, the brain progressively
extracts more information until it is confident it can identify what the person is
seeing. In Artificial Intelligence (AI), this fine-tuning process is known as Deep
Learning (DL).
In both artificial and biological networks, when neurons process the input they
receive, they decide whether the output should be passed on to the next layer as
input. The decision of whether or not to send information on is called bias and
it’s determined by an activation function built into the system. For example, an
artificial neuron may only pass an output signal on to the next layer if its inputs
(which are actually voltages) sum to a value above some particular threshold value.
Because activation functions can either be linear or non-linear, neurons will often
have a wide range of convergence and divergence. Divergence is the ability for one
neuron to communicate with many other neurons in the network and convergence is
the ability for one neuron to receive input from many other neurons in the network.
Figure 2.1 depicts the simple structure of an artificial neuron.
In which:
1
https://isaacchanghau.github.io/
4
Figure 2.1: Structure of an artificial neuron
1
x
1
, x
2
, ..., x
n
(axons in biology) are the signal values transmitted from other
places to the neuron.
These input signals reach neuron through the connections. Each of these
connections has an assigned value called the connection weight (synapse in
biology). On the figure, we can see that w
1j
, w
2j
, ..., w
nj
are those weights.
Their purpose is to amplify or decrease the value of the input signal.
w
1j
x
1
, w
2j
x
2
, ..., w
nj
x
n
are weighted inputs (dendrites in biology).
The transfer function is a function used to calculate the direct input signal
value of a neuron. This function is usually in the form of the sum of the
weighted inputs as described above.
The activation function is a function used to activate neurons. They are usually
in the form of nonlinear functions that allow a set of neurons to simulate
complex models. In general, this function is a time dependent function of
the internal state of the neuron, of the input signal of the neuron, and of the
threshold in the neuron. The result of the activation function is the output
value of the neuron (output axon). The selection of the activation function is
also an important issue in the design of artificial neurons. Figure 2.2 shows
some common activation functions. The function Sigmoid and function Tanh
are two saturated nonlinear functions. They have the disadvantage that when
the input has an large absolute value, the gradient of this function will be
very close to 0. That means that the model parameters corresponding to the
neuron in question will almost not be updated. However, the disadvantage
is also the advantage when they are visual simulations for the benefits of
Batch Normalization (BN) layer. The activation function Rectified Linear
Unit (ReLU), on the other hand, is not limited to the output, and at the same
time has fast computation for both its output value and its derivative, so it is
frequently used.
Threshold is a value that specifies the output signal limit for each neuron.
2
https://adilmoujahid.com/
5
Figure 2.2: Popular activation functions
2
2.1.2 Feedforward Neural Network
A Feedforward Neural Network (FNN) is an ANN wherein connections between
the nodes do not form a cycle. As such, it is different from its descendant: Recurrent
Neural Networks (RNNs).
The FNN was the first and simplest type of ANN devised. In this network, the
information moves in only one direction-forward-from the input nodes, through the
hidden nodes (if any) and to the output nodes. There are no cycles or loops in the
network. Figure 2.3 illustrates a FNN with 2 layers.
Figure 2.3: Typical structure of a feedforward neural network
3
An output layer with 1 output unit
A hidden layer with 5 units
The network has 4 input units. The 4 inputs are shown as green circles and
these do not belong to any layer of the network (although the inputs sometimes
are considered as a virtual layer with layer number 0).
Any layer that is not an output layer is a hidden layer. This network therefore
has 1 hidden layer and 1 output layer.
A Fully Connected Neural Network (FCN) consists of a series of fully connected
layers. A fully connected layer is a function from R
m
to R
n
in which each output
dimension depends on each input dimension.
For each input to the network, we expect to receive a desired output called target
instead of the actual output (often called prediction). The process of minimizing the
3
https://uc-r.github.io/
6
difference between target and prediction is called training process (learning process)
and the difference is usually formed by a function called loss function.
2.1.3 Convolutional Neural Network
Convolutional Neural Networks (CNNs) are very similar to the previously intro-
duced ANN: They are made up of neurons that have learnable weights and biases.
Each neuron receives some inputs, performs a dot product and optionally follows it
with a non-linearity. The whole network still expresses a single differentiable score
function: from the raw image pixels for example on one end to class scores at the
other.
So what changes? CNN architectures make the explicit assumption that the
inputs are images, which allows us to encode certain properties into the architecture.
These then make the forward function more efficient to implement and vastly reduce
the amount of parameters in the network. Specifically, input and output of each
layer are arranged in 3 dimensions: width, height, depth. Figure 2.4 shows the shape
of the volumes from the input to the output of a CNN. For example, the input of
a CNN (which is also the input of the first layer) can be an image from CIFAR-10
dataset which has dimensions 32x32x3 (width, height, depth respectively) and the
output of a CNN can be a 10-dimensional vector of 10 class scores. As we will soon
see, the neurons in a layer will only be connected to a small region of the layer
before it, instead of all of the neurons in a fully-connected manner (This is opposite
of FCN).
Figure 2.4: The shape of several neural network volumes
4
Figure 2.5 depicts such a convolutional neural network.
Figure 2.5: Typical architecture of a convolutional neural network
5
4
https://cs231n.github.io/
7
Unlike a normal ANN which only has a series of fully connected layers, a CNN
is made up of several different types of overlapping layers. The typical layers of a
CNN are categorized as follows:
A Convolutional layer
The convolutional layer is the core building block of a CNN that does most of the
computational heavy lifting (the convolution operation). The convolutional layer’s
parameters consist of a set of learnable filters (These filters together with the input
volume are the operands of the convolution operation). Every filter is small spatially
(along width and height), but extends through the full depth of the input volume.
For example, a typical filter on a first layer of a CNN might have size 5x5x3 (i.e.
5 pixels width and height, and 3 because images have depth 3, the color channels).
During the forward pass, we slide (more precisely, convolve) each filter across the
width and height of the input volume and compute dot products between the entries
of the filter and the input at any position. As we slide the filter over the width and
height of the input volume we will produce a 2-dimensional activation map that
gives the responses of that filter at every spatial position. Intuitively, the network
will learn filters that activate when they see some type of visual feature such as an
edge of some orientation or a blotch of some color on the first layer, or eventually
entire honeycomb or wheel-like patterns on higher layers of the network. Now, we
will have an entire set of filters in each convolutional layer (e.g. 12 filters), and each
of them will produce a separate 2-dimensional activation map. We will stack these
activation maps along the depth dimension and produce the output volume. Figure
2.6 and figure 2.7 depict such a process.
Figure 2.6: An activation map created by a pair of input volume and filter using
the convolution operation
6
When dealing with high-dimensional inputs such as images, as we saw above it
is impractical to connect neurons to all neurons in the previous layer (We can see
that the neuron outputs that are spatially distant in the activation map have little
information relationship). Instead, we will connect each neuron to only a local region
of the input volume. The spatial extent of this connectivity is a hyperparameter
called the receptive field of the neuron (equivalently this is the filter size). The
extent of the connectivity along the depth axis is always equal to the depth of the
input volume. It is important to emphasize again this asymmetry in how we treat the
5
https://www.mathworks.com/
6
https://slideplayer.com/
7
https://slideplayer.com/
8
Figure 2.7: If we have 6 5x5x3 filters, we’ll get an output volume of 6 separate
activation maps
7
.
spatial dimensions (width and height) and the depth dimension: The connections
are local in space (along width and height), but always full along the entire depth
of the input volume.
We have explained the connectivity of each neuron in the convolutional layer to
the input volume, but we haven’t yet discussed how many neurons there are in the
output volume or how they are arranged. Three hyperparameters control the size
of the output volume: the depth, stride and zero-padding. We discuss these next:
First, the depth of the output volume is a hyperparameter: It corresponds to
the number of filters we would like to use, each learning to look for something
different in the input. For example, if the first convolutional Layer takes as
input the raw image, then different neurons along the depth dimension may
activate in presence of various oriented edges, or blobs of color. We will refer
to a set of neurons that are all looking at the same region of the input as a
depth column (some people also prefer the term fibre). Figure 2.8 depicts
one of depth columns with depth = 5.
Second, we must specify the stride with which we slide the filter. When the
stride is 1 then we move the filters one pixel at a time. When the stride is
2 (or uncommonly 3 or more, though this is rare in practice) then the filters
jump 2 pixels at a time as we slide them around. This will produce smaller
output volumes spatially.
As we will soon see, sometimes it will be convenient to pad the input volume
with zeros around the border. The size of this zero-padding is a hyperpa-
rameter. The nice feature of zero padding is that it will allow us to control
the spatial size of the output volumes (most commonly as we’ll see soon we
will use it to exactly preserve the spatial size of the input volume so the input
and output width and height are the same). For easier understanding, refer
to figure 2.9.
We can compute the spatial size of the output volume as a function of the input
volume size W , the receptive field size of the convolutional layer neurons F , the
stride with which they are applied S, and the amount of zero padding used P on
the border. You can convince yourself that the correct formula for calculating how
many neurons “fit” is given by (
W F +2P
S
)+ 1. For example for a 7x7 input and a 3x3
8
https://cs231n.github.io/
9
https://www.researchgate.net/
9
Figure 2.8: Depth column
8
Figure 2.9: How zero-padding affects the spatial size of the output volume
9
filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would
get a 3x3 output.
B Pooling layer
It is common to periodically insert a pooling layer in-between successive convo-
lutional layers in a CNN architecture. Its function is to progressively reduce the
spatial size of the representation to reduce the amount of parameters and computa-
tion in the network, and hence to also control overfitting. The pooling layer operates
independently on every depth slice of the input and resizes it spatially, using the
MAX operation. The most common form is a pooling layer with filters of size 2x2
applied with a stride of 2 downsamples every depth slice in the input by 2 along
both width and height, discarding 75% of the activations. Every MAX operation
would in this case be taking a max over 4 numbers (little 2x2 region in some depth
slice). The depth dimension remains unchanged. More generally, the pooling layer:
accepts a volume of size W
1
× H
1
× D
1
requires two hyperparameters:
their spatial extent F ,
the stride S,
produces a volume of size W
2
× H
2
× D
2
where:
10
W
2
= (W
1
F )/S + 1
H
2
= (H
1
F )/S + 1
D
2
= D
1
introduces zero parameters since it computes a fixed function of the input
For pooling layers, it is not common to pad the input using zero-padding.
It is worth noting that there are only two commonly seen variations of the max
pooling layer found in practice: a pooling layer with F = 3, S = 2 (also called
overlapping pooling), and more commonly F = 2, S = 2. Pooling sizes with larger
receptive fields are too destructive.
In addition to max pooling, the pooling units can also perform other functions,
such as average pooling or even L2-norm pooling. Average pooling was often used
historically but has recently fallen out of favor compared to the max pooling oper-
ation, which has been shown to work better in practice.
Figure 2.10: Max pooling layer
10
C Normalization layer
Many types of normalization layers have been proposed for use in CNN architec-
tures, sometimes with the intentions of implementing inhibition schemes observed
in the biological brain. However, these layers have since fallen out of favor because
in practice their contribution has been shown to be minimal, if any.
D Fully-connected layer
See (sec. 2.1.2) for more information!
2.1.4 Recurrent Neural Network
RNNs are a class of ANNs that allow previous outputs to be used as inputs while
having hidden states. They are typically as in figure 2.11.
For each timestep t, the activation a
<t>
and the output y
<t>
are expressed as
follows:
a
<t>
= g
1
(W
aa
a
<t1>
+ W
ax
x
<t>
+ b
a
) (2.1)
y
<t>
= g
2
(W
ya
a
<t>
+ b
y
) (2.2)
where W
ax
, W
aa
, W
ya
, b
a
, b
y
are coefficients that are shared temporally and g
1
,
g
2
are activation functions.
10
https://computerscienc ewiki.org/
11
https://stanford.edu/
11
Figure 2.11: The architecture of a recurrent neural network
11
Figure 2.12: Neuron structure of a recurrent neural network
12
The advantages of a typical RNN are:
Computation takes into account historical information.
Weights are shared across time.
possibility of processing input of any length
model size not increasing with size of input
The drawbacks a typical RNN are:
computation being slow
cannot consider any future input for the current state
difficulty of accessing information from a long time ago
RNN models are mostly used in the fields of natural language processing, music
generation, sentiment classification, name entity recognition, machine translation,
etc.
In the case of a RNN, the loss function L of all time steps is defined based on
the loss at every time step as follows:
L(by, y) =
T
y
X
t=1
L(by
<t>
, y
<t>
) (2.3)
12
https://stanford.edu/
12
Backpropagation is done at each point in time. At timestep T, the derivative of
the loss L with respect to weight matrix W is expressed as follows:
L
(T )
W
=
T
X
t=1
L
(T )
W
(t)
(2.4)
The vanishing and exploding gradient phenomena are often encountered in the
context of RNNs. The reason why they happen is that it is difficult to capture
long term dependencies because of multiplicative gradient that can be exponentially
decreasing/increasing with respect to the number of layers.
Gradient clipping is a technique used to cope with the exploding gradient problem
sometimes encountered when performing backpropagation. By capping the maxi-
mum value for the gradient, this phenomenon is controlled in practice.
In order to remedy the vanishing gradient problem, specific gates are used in
some types of RNNs and usually have a well-defined purpose. They are usually
noted Γ and are equal to:
Γ = σ(W x
<t>
+ Ua
<t1>
+ b) (2.5)
where W, U, b are coefficients specific to the gate and σ is the sigmoid function.
The main ones are summed up in the table 2.1.
Table 2.1: Different types of gates of an RNN
Type of gate
Role Used in
Update gate Γ
u
How much past should matter now? GRU, LSTM
Relevance gate Γ
r
Drop previous information? GRU, LSTM
Forget gate Γ
f
Erase a cell or not? LSTM
Output gate Γ
o
How much to reveal of a cell? LSTM
Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) units deal
with the vanishing gradient problem encountered by traditional RNNs, with LSTM
being a generalization of GRU. Below is a table 2.2 summing up the characterizing
equations of each architecture.
The table 2.3 below sums up the other commonly used RNN architectures.
13
https://stanford.edu/
14
https://stanford.edu/
13
Table 2.2: GRU and LSTM
13
Table 2.3: Variants of RNNs
14
Bidirectional (BRNN) Deep (DRNN)
2.2 Region of Interest pooling
Region of Interest (RoI) is an operation widely used in object detection tasks
using CNNs. For example, it can be used to detect multiple cars and pedestrians in
a single image. In a typical case, its purpose is to perform max pooling on inputs
of nonuniform sizes to obtain fixed-size feature maps (e.g. 7x7).
Two major tasks in computer vision are object classification and object detection.
In the first case, the system is supposed to correctly label the dominant object in
an image. In the second case, it should provide correct labels and locations for all
objects in an image. Of course, there are other interesting areas of computer vision,
such as image segmentation, but in this section, we’re going to focus on detection.
In this task, we’re usually supposed to draw bounding boxes around any object
from a previously specified set of categories and assign a class to each of them. For
example, let’s say we’re developing an algorithm for self-driving cars and we’d like
14
to use a camera to detect other cars, pedestrians, cyclists, etc. - our dataset might
look like this. In this case, we’d have to draw a box around every significant object
and assign a class to it. This task is more challenging than classification tasks such
as MNIST or CIFAR. On each frame of the video, there might be multiple objects,
some of them overlapping, some poorly visible, or occluded. Moreover, for such an
algorithm, performance can be a key issue. In particular, for autonomous driving,
we have to process tens of frames per second. So how do we solve this problem?
The object detection architecture we’re going to be talking about in this section
is broken down into two stages:
Region proposal: Given an input image, its purpose is to find all possible
places where objects can be located. The output of this stage should be a list
of bounding boxes of likely positions of objects. These are often called region
proposals or RoI. There are quite a few methods for this task, but we’re not
going to talk about them in this section.
Final classification: for every region proposal from the previous stage, decide
whether it belongs to one of the target classes or to the background. Here we
could use a deep CNN.
Figure 2.13: RoI pooling
15
Usually in the proposal phase we have to generate a lot of RoIs. Why? If an
object is not detected during the first stage (region proposal), there’s no way to
correctly classify it in the second phase. That’s why it’s extremely important for
the region proposals to have a high recall. And that’s achieved by generating very
large numbers of proposals (e.g., a few thousands per frame). Most of them will be
classified as background in the second stage of the detection algorithm.
Some problems with this architecture are:
Generating a large number of RoIs can lead to performance problems. This
would make real-time object detection difficult to implement.
15
https://deepsens e.ai/
15
It’s suboptimal in terms of processing speed.
You can’t do end-to-end training, i.e., you can’t train all the components of
the system in one run (which would yield much better results)
2.2.1 Conventional RoI pooling
RoI pooling is a ANN layer used for object detection tasks. It was first proposed
by Ross Girshick in April 2015 and it achieves a significant speedup of both training
and testing. It also maintains a high detection accuracy. The layer takes two inputs:
a fixed-size feature map obtained from a deep CNN with several convolutional
layers and max pooling layers.
an N × 5 matrix of representing a list of RoIs, where N is a number of RoIs.
The first column represents the image index and the remaining four are the
coordinates of the top left and bottom right corners of the RoI.
What does the RoI pooling actually do? For every RoI from the input list, it
takes a section of the input feature map that corresponds to it and scales it to some
pre-defined size (e.g., 7 7). The scaling is done by:
dividing the region proposal into equal-sized sections (the number of which is
the same as the dimension of the output)
finding the largest value in each section
copying these max values to the output buffer
The result is that from a list of rectangles with different sizes we can quickly
get a list of corresponding feature maps with a fixed size. Note that the dimension
of the RoI pooling output doesn’t actually depend on the size of the input feature
map nor on the size of the region proposals. It’s determined solely by the number
of sections we divide the proposal into. What’s the benefit of RoI pooling? One of
them is processing speed. If there are multiple object proposals on the frame (and
usually there’ll be a lot of them), we can still use the same input feature map for
all of them. Since computing the convolutions at early stages of processing is very
expensive, this approach can save us a lot of time.
Let’s consider a small example to see how it works. We’re going to perform RoI
pooling on a single 8 8 feature map, one RoI and an output size of 2 2. Our input
feature map looks like figure 2.14.
Let’s say we also have a region proposal (top left, bottom right coordinates): (0,
3), (7, 8). In the picture it would look like figure 2.15.
Normally, there’d be multiple feature maps and multiple proposals for each of
them, but we’re keeping things simple for the example. By dividing it into (2 2)
sections (because the output size is 2 2) we get the result like in figure 2.16.
Notice that the size of the RoI doesn’t have to be perfectly divisible by the
number of pooling sections (in this case our RoI is 7x5 and we have 2x2 pooling
sections). The max values in each of the sections as in figure 2.17, and that’s the
output from the RoI pooling layer.
16
https://deepsens e.ai/
17
https://deepsens e.ai/
18
https://deepsens e.ai/
19
https://deepsens e.ai/
16
Figure 2.14: Example of an feature map
16
Figure 2.15: Example of an region proposal
17
2.2.2 RoI Align
The main difference between RoI pooling and RoI align is quantization. RoI
align is not using quantization for data pooling. You know that Fast R-CNN is
applying quantization twice: first time in the mapping process and the second time
during the pooling process. The figure 2.18 depicts the quantization of RoI pooling.
These quanitizations introduce mis-alignments b/w the RoI and extracted fea-
tures. This may not impact detection/classification which is robust to small pertuba-
tions but has a large negative effect on predicting pixel-accurate masks. To address
this, RoI align was proposed which removes any quantization operations. Instead,
bi-linear interpolation is used to compute the exact values for every proposal.
Similar to RoI pooling, the proposal is divided into pre-fixed number of smaller
regions. Within each smaller regions, 4 points are sampled. The feature value for
each sampled point is computed with bi-linear interpolation (see figure 2.19). Max
20
https://qiita.com/
17
Figure 2.16: 2x2 pooling sections
18
Figure 2.17: Pooled feature map
19
Figure 2.18: The quantization of RoI pooling
20
or average operation is carried out to get final output (see figure 2.20).
2.2.3 Other popular RoI pooling techniques
To better design for text regions, the varying-size RoI pooling [2] has been devel-
oped to keep the aspect ratio unchanged for text recognition. Moreover, to handle
21
https://developpaper.com/
22
https://qiita.com/
18
Figure 2.19: RoI Align
21
Figure 2.20: RoI Align uses max/average operation to get output for each smaller
region
22
.
oriented text regions, RoI-Rotate [3] have been developed to address this issue us-
ing affine transformation, motivated by the idea of Spatial Transformer Network
(STN) [4] to learn a transformation matrix. By contrast, for quadrangle region
proposals, TextNet’s authors develop perspective RoI transform [5] to convert an
arbitrary-size quadrangle into a small variable-width and fixed-height feature map,
which can be regarded as the generalization of the existing methods. Specifically,
perspective RoI transform can warp each RoI by perspective transformation and
bilinear sampling.
19
Figure 2.21: Object detection
23
2.3 Detection and Segmentation
2.3.1 Detection
To gain a complete image understanding, we should not only concentrate on clas-
sifying different images, but also try to precisely estimate the concepts and locations
of objects contained in each image. This task is referred as object detection, which
usually consists of different subtasks such as face detection, pedestrian detection. As
one of the fundamental computer vision problems, object detection is able to provide
valuable information for semantic understanding of images and videos, and is related
to many applications, including image classification, human behavior analysis, face
recognition and autonomous driving. Meanwhile, Inheriting from ANN and related
learning systems, the progress in these fields will develop ANN algorithms, and will
also have great impacts on object detection techniques which can be considered as
learning systems. However, due to large variations in viewpoints, poses, occlusions
and lighting conditions, it’s difficult to perfectly accomplish object detection with
an additional object localization task. So much attention has been attracted to this
field in recent years.
The problem definition of object detection is to determine where objects are
located in a given image (object localization) and which category each object belongs
to (object classification). So the pipeline of traditional object detection models can
be mainly divided into three stages: informative region selection, feature extraction
and classification.
Informative region selection: As different objects may appear in any positions
of the image and have different aspect ratios or sizes, it is a natural choice to
scan the whole image with a multi-scale sliding window. Although this exhaus-
tive strategy can find out all possible positions of the objects, its shortcomings
are also obvious. Due to a large number of candidate windows, it is compu-
tationally expensive and produces too many redundant windows. However, if
only a fixed number of sliding window templates are applied, unsatisfactory
regions may be produced.
23
https://viblo.asia/
20
Feature extraction: To recognize different objects, we need to extract visual
features which can provide a semantic and robust representation. SIFT, HOG
and Haar-like features are the representative ones. This is due to the fact
that these features can produce representations associated with complex cells
in human brain. However, due to the diversity of appearances, illumination
conditions and backgrounds, it’s difficult to manually design a robust feature
descriptor to perfectly describe all kinds of objects.
Classification: Besides, a classifier is needed to distinguish a target object
from all the other categories and to make the representations more hierarchi-
cal, semantic and informative for visual recognition. Usually, the Supported
Vector Machine (SVM), AdaBoost, and Deformable Part-based Model (DPM)
are good choices. Among these classifiers, the DPM is a flexible model by com-
bining object parts with deformation cost to handle severe deformations. In
DPM, with the aid of a graphical model, carefully designed low-level features
and kinematically inspired part decompositions are combined. And discrimina-
tive learning of graphical models allows for building high-precision part-based
models for a variety of object classes.
Based on these discriminant local feature descriptors and shallow learnable ar-
chitectures, state of the art results have been obtained on PASCAL VOC object
detection competition and real-time embedded systems have been obtained with a
low burden on hardware. However, small gains are obtained during 2010-2012 by
only building ensemble systems and employing minor variants of successful methods.
This fact is due to the following reasons:
The generation of candidate bounding boxes with a sliding window strategy
is redundant, inefficient and inaccurate.
The semantic gap cannot be bridged by the combination of manually engi-
neered low-level descriptors and discriminatively-trained shallow models.
Thanks to the emergency of Deep Neural Networks (DNNs), a more significant
gain is obtained with the introduction of Regions with CNN features (R-CNN).
DNNs, or the most representative CNNs, act in a quite different way from traditional
approaches. They have deeper architectures with the capacity to learn more complex
features than the shallow ones. Also the expressivity and robust training algorithms
allow to learn informative object representations without the need to design features
manually.
Figure 2.22: Comparison between shallow and deep learning
24
24
https://datascience.s tackexchange.com/
21
2.3.2 Segmentation
Image segmentation is the process of partitioning a digital image into multiple
segments (sets of pixels, also known as image objects). The goal of segmentation
is to simplify and/or change the representation of an image into something that is
more meaningful and easier to analyze. Image segmentation is typically used to
locate objects and boundaries (lines, curves, etc.) in images. More precisely, image
segmentation is the process of assigning a label to every pixel in an image such that
pixels with the same label share certain characteristics.
Figure 2.23: Segmentation types
25
Figure 2.24: Faster R-CNN (object detection) and Mask R-CNN (instance segmen-
tation). For instance segmentation, we need one more output branch (binary mask
branch) to categorize each pixel into its corresponding class
26
.
25
https://www.pyimagesearch.com/
26
https://lilianweng.github.io/
22
2.4 Sampling and Interpolation
Image interpolation occurs in all digital photos at some stage - whether this be in
bayer demosaicing or in photo enlargement. It happens anytime you resize or remap
(distort) your image from one pixel grid to another. Image resizing is necessary when
you need to increase or decrease the total number of pixels, whereas remapping can
occur under a wider variety of scenarios: correcting for lens distortion, changing
perspective, and rotating an image.
Figure 2.25: Image interpolation
27
Even if the same image resize or remap is performed, the results can vary sig-
nificantly depending on the interpolation algorithm. It is only an approximation,
therefore an image will always lose some quality each time interpolation is performed.
This tutorial aims to provide a better understanding of how the results may vary -
helping you to minimize any interpolation-induced losses in image quality.
Interpolation works by using known data to estimate values at unknown points.
For example: If you wanted to know the temperature at noon, but only measured it
at 11 AM and 1 PM, you could estimate its value by performing a linear interpolation
as in figure 2.26
Figure 2.26: Linear interpolation
28
If you had an additional measurement at 11:30 AM, you could see that the bulk
of the temperature rise occurred before noon, and could use this additional data
point to perform a quadratic interpolation as in figure 2.27
The more temperature measurements you have which are close to noon, the more
sophisticated (and hopefully more accurate) your interpolation algorithm can be.
Image interpolation works in two directions, and tries to achieve a best approx-
imation of a pixel’s color and intensity based on the values at surrounding pixels.
The figure 2.28 illustrates how resizing / enlargement works.
Unlike air temperature fluctuations and the ideal gradient above, pixel values can
change far more abruptly from one location to the next. As with the temperature
example, the more you know about the surrounding pixels, the better the interpo-
lation will become. Therefore results quickly deteriorate the more you stretch an
27
https://www.cambridgeincolour.com/
28
https://www.cambridgeincolour.com/
29
https://www.cambridgeincolour.com/
30
https://www.cambridgeincolour.com/
23
Figure 2.27: Quadratic interpolation
29
Figure 2.28: How interpolation works in 2D image
30
image, and interpolation can never add detail to your image which is not already
present.
Interpolation also occurs each time you rotate or distort an image. The previous
example was misleading because it is one which interpolators are particularly good
at. This next example shows how image detail can be lost quite rapidly (see figure
2.29).
Figure 2.29: How interpolation works in 2D image
31
The 90
rotation is lossless because no pixel ever has to be repositioned onto
the border between two pixels (and therefore divided). Note how most of the detail
is lost in just the first rotation, although the image continues to deteriorate with
successive rotations. One should therefore avoid rotating your photos when possible;
if an unleveled photo requires it, rotate no more than once.
The above results use what is called a "bicubic" algorithm, and show signifi-
cant deterioration. Note the overall decrease in contrast evident by color becoming
less intense, and how dark haloes are created around the light blue. The above re-
sults could be improved significantly, depending on the interpolation algorithm and
subject matter.
Common interpolation algorithms can be grouped into two categories: adaptive
and non-adaptive. Adaptive methods change depending on what they are interpo-
lating (sharp edges vs. smooth texture), whereas non-adaptive methods treat all
pixels equally.
31
https://www.cambridgeincolour.com/
24
Non-adaptive algorithms include: nearest neighbor, bilinear, bicubic, spline, sinc,
lanczos and others. Depending on their complexity, these use anywhere from 0 to
256 (or more) adjacent pixels when interpolating. The more adjacent pixels they
include, the more accurate they can become, but this comes at the expense of much
longer processing time. These algorithms can be used to both distort and resize a
photo.
Adaptive algorithms include many proprietary algorithms in licensed software
such as: Qimage, PhotoZoom Pro, Genuine Fractals and others. Many of these
apply a different version of their algorithm (on a pixel-by-pixel basis) when they
detect the presence of an edge - aiming to minimize unsightly interpolation artifacts
in regions where they are most apparent. These algorithms are primarily designed to
maximize artifact-free detail in enlarged photos, so some cannot be used to distort
or rotate an image.
Nearest neighbor is the most basic and requires the least processing time of all
the interpolation algorithms because it only considers one pixel - the closest one to
the interpolated point. This has the effect of simply making each pixel bigger.
Bilinear interpolation considers the closest 2x2 neighborhood of known pixel
values surrounding the unknown pixel. It then takes a weighted average of these 4
pixels to arrive at its final interpolated value. This results in much smoother looking
images than nearest neighbor (see figure 2.30).
Figure 2.30: Bilinear Interpolation
32
There are many other interpolators which take more surrounding pixels into
consideration, and are thus also much more computationally intensive. These al-
gorithms include spline and sinc, and retain the most image information after an
interpolation. They are therefore extremely useful when the image requires multiple
rotations / distortions in separate steps. However, for single-step enlargements or
rotations, these higher-order algorithms provide diminishing visual improvement as
processing time is increased.
32
https://www.cambridgeincolour.com/
25
2.5 Training and Inference
2.5.1 Training
Once a network has been structured for a particular application, that network is
ready to be trained. To start this process the initial weights are chosen randomly.
Then, the training, or learning, begins.
There are two approaches to training - supervised and unsupervised. Supervised
training involves a mechanism of providing the network with the desired output
either by manually "grading" the network’s performance or by providing the desired
outputs with the inputs. Unsupervised training is where the network has to make
sense of the inputs without outside help.
The vast bulk of networks utilize supervised training. Unsupervised training is
used to perform some initial characterization on inputs. However, in the full blown
sense of being truly self learning, it is still just a shining promise that is not fully
understood, does not completely work, and thus is relegated to the lab.
A Supervised training
In supervised training, both the inputs and the outputs are provided. The net-
work then processes the inputs and compares its resulting outputs against the desired
outputs. Errors are then propagated back through the system, causing the system
to adjust the weights which control the network. This process occurs over and over
as the weights are continually tweaked. The set of data which enables the training
is called the "training set." During the training of a network the same set of data is
processed many times as the connection weights are ever refined.
The current commercial network development packages provide tools to monitor
how well an artificial neural network is converging on the ability to predict the right
answer. These tools allow the training process to go on for days, stopping only
when the system reaches some statistically desired point, or accuracy. However,
some networks never learn. This could be because the input data does not contain
the specific information from which the desired output is derived. Networks also
don’t converge if there is not enough data to enable complete learning. Ideally,
there should be enough data so that part of the data can be held back as a test.
Many layered networks with multiple nodes are capable of memorizing data. To
monitor the network to determine if the system is simply memorizing its data in
some nonsignificant way, supervised training needs to hold back a set of data to be
used to test the system after it has undergone its training. (Note: Memorization is
avoided by not having too many processing elements.)
If a network simply can’t solve the problem, the designer then has to review
the input and outputs, the number of layers, the number of elements per layer, the
connections between the layers, the summation, transfer, and training functions, and
even the initial weights themselves. Those changes required to create a successful
network constitute a process wherein the "art" of neural networking occurs.
Another part of the designer’s creativity governs the rules of training. There
are many laws (algorithms) used to implement the adaptive feedback required to
adjust the weights during training. The most common technique is backward-error
propagation, more commonly known as back-propagation. These various learning
techniques are explored in greater depth later in this report.
Yet, training is not just a technique. It involves a "feel," and conscious analysis,
to insure that the network is not overtrained. Initially, an artificial neural network
26
configures itself with the general statistical trends of the data. Later, it continues
to "learn" about other aspects of the data which may be spurious from a general
viewpoint.
When finally the system has been correctly trained, and no further learning
is needed, the weights can, if desired, be "frozen." In some systems this finalized
network is then turned into hardware so that it can be fast. Other systems don’t
lock themselves in but continue to learn while in production use.
B Unsupervised, or adaptive training
The other type of training is called unsupervised training. In unsupervised train-
ing, the network is provided with inputs but not with desired outputs. The system
itself must then decide what features it will use to group the input data. This is
often referred to as self-organization or adaption.
At the present time, unsupervised learning is not well understood. This adaption
to the environment is the promise which would enable science fiction types of robots
to continually learn on their own as they encounter new situations and new envi-
ronments. Life is filled with situations where exact training sets do not exist. Some
of these situations involve military action where new combat techniques and new
weapons might be encountered. Because of this unexpected aspect to life and the
human desire to be prepared, there continues to be research into, and hope for, this
field. Yet, at the present time, the vast bulk of neural network work is in systems
with supervised learning. Supervised learning is achieving results.
One of the leading researchers into unsupervised learning is Tuevo Kohonen, an
electrical engineer at the Helsinki University of Technology. He has developed a
self-organizing network, sometimes called an auto-associator, that learns without
the benefit of knowing the right answer. It is an unusual looking network in that it
contains one single layer with many connections. The weights for those connections
have to be initialized and the inputs have to be normalized. The neurons are set up
to compete in a winner-take-all fashion.
Kohonen continues his research into networks that are structured differently than
standard, feedforward, back-propagation approaches. Kohonen’s work deals with the
grouping of neurons into fields. Neurons within a field are "topologically ordered."
Topology is a branch of mathematics that studies how to map from one space to an-
other without changing the geometric configuration. The three-dimensional group-
ings often found in mammalian brains are an example of topological ordering.
Kohonen has pointed out that the lack of topology in neural network models make
today’s neural networks just simple abstractions of the real neural networks within
the brain. As this research continues, more powerful self learning networks may
become possible. But currently, this field remains one that is still in the laboratory.
2.5.2 Inference
The trained neural network is put to work out in the digital world using what
it has learned - to recognize images, spoken words, a blood disease, or suggest the
shoes someone is likely to buy next, you name it - in the streamlined form of an
application. This speedier and more efficient version of a neural network infers
things about new data it’s presented with based on its training. In the AI lexicon
this is known as "inference".
Inference can’t happen without training. Makes sense. That’s how we gain and
use our own knowledge for the most part. And just as we don’t haul around all
27
our teachers, a few overloaded bookshelves and a red-brick schoolhouse to read a
Shakespeare sonnet, inference doesn’t require all the infrastructure of its training
regimen to do its job well.
So let’s break down the progression from training to inference, and in the context
of AI how they both function.
The first approach looks at parts of the neural network that don’t get activated
after it’s trained. These sections just aren’t needed and can be “pruned” away. The
second approach looks for ways to fuse multiple layers of the neural network into a
single computational step.
It’s akin to the compression that happens to a digital image. Designers might
work on these huge, beautiful, million pixel-wide and tall images, but when they go
to put it online, they’ll turn into a jpeg. It’ll be almost exactly the same, indistin-
guishable to the human eye, but at a smaller resolution. Similarly with inference
you’ll get almost the same accuracy of the prediction, but simplified, compressed
and optimized for runtime performance.
What that means is we all use inference all the time. Your smartphone’s voice-
activated assistant uses inference, as does Google’s speech recognition, image search
and spam filtering applications. Baidu also uses inference for speech recognition,
malware detection and spam filtering. Facebook’s image recognition and Amazon’s
and Netflix’s recommendation engines all rely on inference.
GPUs, thanks to their parallel computing capabilities or ability to do many
things at once are good at both training and inference.
After training is completed, the networks are deployed into the field for “infer-
ence” classifying data to “infer” a result. Here too, GPUs and their parallel
computing capabilities offer benefits, where they run billions of computations
based on the trained network to identify known patterns or objects.
28
Chapter 3
Related Work
In the last chapter, we skimmed through some of the most fundamental con-
cepts of deep learning, image processing, and computer vision. It will provide us
with the most basic knowledge for the learning process about relevant topic in this
chapter: Chinese, Han Nom scene text detection and recognition methods. On a
side note, the researched methods and the related arguments will be about those
that apply artificial intelligence and machine learning because of their applicability
and effectiveness in the field of computer vision.
Basically, scene text spotting methods can be classified into the following three
groups based on its own architecture:
Architecture 1: Detection and recognition are two separate models in terms
of backward path. Specifically, image will first go through detection model
in which all detected regions of text lines of characters (for textline detec-
tion) or set of regions of characters (for character detection) will be cropped
from the original image. Those cropped images will then be post-processed
by some techniques such as: image rectification, vertical textline rotation or
image augmentation before going through the recognition model. The recog-
nition model will then classify each character into its corresponding class. The
backward path is blocked by the architecture between the detection model and
the recognition model. Because of the very limited support of the high-level
and low-level Application Programming Interfaces (APIs) in providing utility
functions, designing a path for the gradient to flow back from the input of the
recognition model to the output of the detection model is still not feasible.
The figure 3.1 depicts the architecture of this type of model. Some of the
popular models with this type of architecture are:
Detection models: TextBoxes(++) [6], SegLink [7], CTPN [8], EAST [9],
PMTD [10], OBD [11], etc.
Recognition models: CRNN [12], RARE [13], etc.
Architecture 2: In this architecture, detection and recognition are two com-
ponents (branches) of the same model. They are actually neurons and layers
that are built on top of a ANN called the backbone network whose mission is to
extract high-level abstract features for both detection and recognition tasks.
Therefore, gradients from both detection and recognition branches can flow
back to the backbone network and the training process will benefit from it.
The bottom line here is the task of designing a branching point from the back-
bone network to the detection and recognition branches. RoI pooling is the
29
popular method used to connect the recognition model to the detection model
in an RNN-based recognition method. For CNN-based recognition models,
normal convolutional layers are used for this task. The figure 3.2 depicts the
architecture of this type of model. Some of the popular methods which make
use of this architecture are:
CNN-based methods: CharNet [14], etc.
RNN-based methods: FOTS [3], Mask TextSpotter [15], Textdragon, etc.
Architecture 3: Although the second architecture has many improvements,
there are still limitations that affect the ability of the model:
Firstly, branching splits detection and recognition into two separate tasks
(two branches do not share any learnable parameters) but they are in fact
two closely-related tasks: efficiency in the detection process increases
efficiency in the recognition process.
Secondly, the point of connection between two tasks is not working really
well in both forward and backward passes.
Authors of CRAFTS [1] propose a totally new architecture (see figure 3.3)
in which the gradient from the recognition component can flow back easily
to the detection and backbone components which will boost the detection
performance just by reducing the recognition loss.
Figure 3.1: Architecture 1 - Detection and Recognition are two separate models.
Figure 3.2: Architecture 2 - Detection and Recognition are two branches of the same
model.
30
Figure 3.3: Architecture 3 - Detection and Recognition are two components of the
same model in which the recognition model is built on top of the detection model.
3.1 Detection and text-spotting models
Current detection methods can be divided into three categories:
Regression-based methods: Faster RCNN [16], SSD [17], YOLO-based [18]
methods like TextBoxes++ [6], etc.
Segmentation-based methods: Pixellink [19], TextSnake [20], TextField [21],
etc.
Hybrid-based methods: One branch for detection using regression and another
branch for segmentation like Mask RCNN [22] based methods
Some of the popular emerging models that make use of regression, segmentation,
or more advanced output format called heatmaps are:
3.1.1 CharNet (Convolution Character Network)
CharNet is a model of architecture type 2 which is an one-stage model that can
process two tasks simultaneously in one pass. CharNet directly outputs bounding
boxes of words and characters, with corresponding character labels.
CharNet is an ANN whose output has two branches:
Character branch: Its purpose is to detect and recognize each character in the
image.
Detection branch: Its purpose is to detect text instance which can be lines of
character (textlines), words, etc.
Due to the lack of datasets with enough annotations, the authors proposed an
‘iterative character detection’ approach which is in fact a weakly-supervised learning
method that is able to transform the ability of character detection learned from
synthetic data to real-world images.
The figure 3.4 depicts the overview of the architecture.
A Backbone networks
ResNet-50 [23] and Hourglass [24] networks are used as backbone. For ResNet-
50, the architecture in East [9] are used. The convolutional feature maps with 4 ×
down-sampling are used, then make CharNet able to identify extremely small-scale
31
Figure 3.4: Overview of the CharNet, which contains two branches working in paral-
lel: a character branch for direct character detection and recognition, and a detection
branch for text instance detection [14].
Figure 3.5: The architecture of Hourglass
1
text instances. Then, two Hourglass modules are stacked on to ResNet-50 and the
final feature maps are up-sampled to 1/4 resolution of the input image.
The Hourglass network is a type of convolutional encoder-decoder network (which
means it uses convolutional layers to decompose and reconstruct the input). They
accept input (in our case, an image), and then extract features from the input by
decomposing the image into a feature matrix.
Then, it combines this feature matrix with the previous layers, which have a
higher understanding of space than the feature matrix (it has a better sense of the
position of the object in the image than the feature matrix).
Note: The feature matrix has low spatial understanding, which means it can’t
really know where the object is in the image. This is because, in order to be able to
extract features of an object, we must discard all pixels that are not features of the
object. This means discarding all background pixels, and by doing so, it removes all
knowledge about the position of the object in the image. By combining the feature
matrix with the earlier layers of the network that have a higher understanding of
space, we can learn more about the input (what it is + its position in the image).
The figure 3.6 depicts different parts of the Hourglass network.
B Character branch
Input feature maps have the resolution of 1/4 times the resolution of the in-
put image of CharNet. This branch consists of three sub-branches called: text
instance segmentation, character detection và character recognition. The first two
sub-branches have 3 convolutional layers with its corresponding filter sizes: 3x3,
1
https://www.programmersought.com/
2
https://www.programmersought.com/
32
Figure 3.6: Different components of the Hourglass network
2
3x3, 1x1. The third sub-branch has one more convolutional layer with its filter size
be 3x3.
For sub-branch text instance segmentation, its labels will be binary mask like in
Mask R-CNN. The output of this sub-branch will be a 2-channel feature map, one
for ‘being character’ probability and another for ‘being not character’ probability
for each pixel.
For sub-branch character detection, its output will be 5-channel feature maps
with the coresponding meaninngs:
the coordinate of the top edge of the character bounding box
the coordinate of the bottom edge of the character bounding box
the coordinate of the left edge of the character bounding box
the coordinate of the right edge of the character bounding box
the angle of rotation of the character bounding box
For sub-branch character recognition, its output will be 68-channel feature maps
(for 68 classes including 26 Latin characters, 10 number characters and 32 special
characters).
Output feature maps from all sub-branches will have the same resolution as the
branch’s input feature maps (1/4 input image of CharNet).
C Detection branch
Depending on the target object, this branch will have different implementations:
If the target object is multi-oriented text: EAST detector model is used. This
model consists of two branches: text instance segmentation and text instance
detection. Its corresponding outputs are 2-channel feature maps for classifica-
tion (text and non-text probability) and 5-channel feature maps for bounding
boxes prediction (4 coordinates and 1 angle).
If the target object is curved text: TextField model is used (basically, the
author created a third branch after the character branch and the detection
branch to compute the direction field, helping to classify the text instances in
the image).
33
D Iterative character detection
The iterative character detection are described as follows:
We first train an initial model on synthetic data, where both char-level and
instance-level annotations are available to the CharNet. Then we apply the
trained model to the training images from a real-world dataset, where char-
level bounding boxes are predicted by the learned model.
We explore the aforementioned rule to collect the "correct" char-level bounding
boxes detected in real-world images (char-level bounding box is considered as
"correct" if it is in ‘text instance’ bounding box which has number of predicted
character bounding boxes equal to length of the transcription label), which are
used to further train the model with the corresponding transcripts provided.
Note that we do not use the predicted character labels, which are not fully
correct and would reduce the performance of the model.
This process is implemented iteratively to enhance the model capability gradu-
ally for character detection, which in turn continuously improves the quality of
the identified characters, with an increasing number of the “correct” char-level
bounding boxes generated, as shown in figure 3.7.
Figure 3.7: Character bounding boxes generated at 4 interactive steps from left to
right. Red boxes indicate the identified “correct” ones by our rule, while blue boxes
mean invalid ones, which are not collected for training in next step [14].
3.1.2 PMTD (Pyramid Mask Text Detector)
PMTD is based on the Mask R-CNN model.
Disadvantages of some previous Mask R-CNN based methods:
The use of the prototype of Mask R-CNN means classifying the character area
and the non-character area by pixels, not creating a shape-specific text mask.
This results in not taking advantage of information about the shape of the
text in the label (usually the label is a quadrilateral).
34
The label has not been correctly assigned. In particular, many background
pixels that are not in the text are treated as text pixels (see figure 3.8).
Errors from the Region Proposal Network (RPN) section affects the process
of finding text boxes (see figure 3.9).
Figure 3.8: Examples with imprecise segmentation labels. The area within green
box denotes the manually annotated text instance. Many background pixels not
belonging to the text instance are mislabeled as the foreground pixels, especially at
the border of the text box, which may hurt the performance of the Mask R-CNN
based methods [10].
Figure 3.9: The red box is the predicted bounding box and the green box refers to
the predicted text box. The existing Mask R-CNN based methods suffer from the
errors of bounding box detection while PMTD can regress more accurate text box
with the help of the informative soft text mask [10].
Therefore, the authors proprosed some solutions:
Turn classification problem into regression problem: Instead of classifying each
pixel into one of two classes {0, 1}, PMTD assigns a value in the range [0, 1]
for each pixel called soft pyramid label (see figure 3.10).
This value is found based on the distance from the pixel to the edge of each
text instance. This helps to encode information about text shape and location
into the training data. Such pixel-labeling technique also reduces the effect of
mislabeled pixels near the edge of the text instance.
For the generation of text boxes, PMTD reinterprets the obtained 2D soft mask
into 3D space and introduces a novel plane clustering algorithm to derive the
optimal text box on the basis of 3D shape.
The figure 3.11 depicts the overview of the architecture of PMTD.
Specifically, we assign the center of text region as the apex of the pyramid with
an ideal value score = 1 and the boundary of text region as the bottom edge of the
pyramid. We use the linear interpolation to fill each triangle side of the pyramid, as
illustrated in figure 3.12. Do the same for all text instances in image with the raw
35
Figure 3.10: Previous methods aim to find {0, 1} label for each pixel while PMTD
assigns a soft pyramid label of the value [0, 1] [10].
Figure 3.11: Overall architecture of PMTD [10]
Figure 3.12: Generation of soft pyramid label. For a pixel in the text area, its label
is the height of the pyramid [10].
labels provided, we will have a new training label mask called soft text mask which
is used directly for training PMTD.
In the inference phase, we will generate text bounding boxes by extracting in-
formation from soft text mask. Specifically, as a reverse process of generating a
pyramid label (soft text mask) from text region, we will first construct the pyramid
from the text mask, then take the bottom edge of the pyramid as the output text
box. Hence, the critical point is to parameterize and rebuild the pyramid.
Formally, the pyramid is composed of four supporting planes and one base plane.
In the context of pyramid mask, we can convert the predicted soft mask into a
36
point set of (x, y, z), in which the (x, y) denotes the location and z stands for the
predicted score of this pixel. The base plane is formulated as the plane z = 0, and
each supporting plane can be uniquely determined by the equation Ax + By + Cz
+ D = 0, C = 1. Consequently, the task of the plane clustering algorithm is reduced
to find the optimal parameter A, B, D for each supporting plane.
In the initialization stage, the positive points set P is built by the condition z
> 0.1. Then the apex of the initial pyramid is assigned as the center of P, with an
ideal score z = 1. The four vertexes of the pyramid in the bottom face are initialized
as four corner points of the predicted text bounding box, shown in the left image in
figure 3.13.
After initializing the pyramid, an iterative updating scheme is implemented for
clustering points, which is shown in figure 3.13. In the assignment step, we partition
each point to the nearest plane, and in the update step, we employ the robust
least square algorithm to regress four supporting planes from the clustered points
respectively, which is robust to the noise in the predicted text mask.
Figure 3.13: Plane algorithm [10]
3.1.3 OBD (Orderless Box Discretization Network)
OBD [11] is invented to ensure consistent labeling, which is important for main-
taining a stable training process, especially when it comprises a large amount of
data. It first discretizes the quadrilateral box into several key edges containing all
potential horizontal and vertical positions. To decode accurate vertex positions, a
simple yet effective matching procedure is proposed for reconstructing the quadri-
lateral bounding boxes.
The proposed scene text detection system consists of three core components:
an Orderless Box Discretization (OBD) block, a Matching-Type Learning (MTL)
block, and Rescoring and Post-Processing (RPP) block. Figure 3.14 illustrates the
overall pipeline of the proposed framework.
Figure 3.14: The architecture of OBD network [11]
A Orderless box discretization
The purpose of multi-orientation scene text detection is to accurately localize
the textual content by generating outputs in the form of rectangular or quadrilat-
37
eral bounding boxes. Compared with rectangular annotations, quadrilateral labels
demonstrate an increased capability to cover effective text regions, especially for
rotated texts. However, simply replacing rectangular bounding boxes with quadri-
lateral annotations can introduce inconsistency because of the sensitivity of the
non-segmentation-based methods to label sequences. As shown in figure 3.15, the
detection model might fail to obtain accurate features for the corresponding points
when facing small disturbances. One possible reason behind this is that the neural-
network-based regressor for bounding box prediction is essentially a nonlinear con-
tinuous function, which means that each input is only mapped to one output. Thus
a non-function or a function with a steep gradient cannot be effectively fitted. In
this case, a small disturbance may completely change the whole sequence of the
vertex and thus a similar input may result in completely different output as well
as a steep gradient. Therefore, instead of predicting sequence-sensitive distances or
coordinates, an OBD block is proposed to discretize the quadrilateral box into eight
Key Edges (KEs) comprising order-irrelevant points; i.e., minimum x(x
min
) and
y(y
min
)), the second-smallest x(x
2
) and y(y
2
), the second-largest x(x
3
) and y(y
3
),
and the maximum x(x
max
) and y(y
max
) (see figure 3.15). x-KEs and y-KEs are
used in the following sections to represent [x
min
, x
2
, x
3
, x
max
] and [y
min
, y
2
, y
3
, y
max
],
respectively.
(a) Previous regression-based methods (b) The OBD network
Figure 3.15: Comparison of (a) previous methods and (b) the OBD network. Previ-
ous methods directly regress the vertices, which can often be adversely affected by
inconsistent labeling of training data, resulting in unstable training and unsatisfac-
tory performances. OBD model tackles this problem and removes the ambiguity by
discretizing a quadrilateral bounding box that is orderless [11].
Specifically, the proposed approach is based on the widely used generic object
detection framework, Mask R-CNN [22]. As shown in figure 3.16, the proposals
processed by RoIAlign are fed into the OBD block with the pooling size of 14 × 14,
where the feature maps are forwarded through four convolutional layers with 256
output channels. The output features are then upsampled by a 2 × deconvolutional
layer and a 2 × bilinear upscaling layers. Thus, the output size of the feature maps
F
out
is M ×M, where M is 56 in the implementation. Furthermore, two convolution
kernels shaped as 1 × M and M × 1 with six channels are employed to shrink the
horizontal and vertical features for the x-KEs and y-KEs, respectively. Finally,
the OBD model is trained by minimizing the cross-entropy loss L
ke
over an M-way
softmax output, where the corresponding positions of the ground-truth KEs are
38
assigned to each output channel.
Figure 3.16: Illustration of the OBD and MTL blocks [11]
Similar to Mask R-CNN [22], the overall detector is trained in a multi-task
manner. Thus, the loss function comprises four terms:
L = L
cls
+ L
box
+ L
mask
+ L
ke
, (3.1)
where the first three terms, L
cls
, L
box
and L
mask
, follow the same settings as
presented in [22].
B Matching-type learning
It is noteworthy that the OBD block only learns to predict the numerical values
of eight KEs but is unable to predict the connection between the x-KEs and y-KEs.
Therefore, The authors designed a proper matching procedure to reconstruct the
quadrilateral bounding box from the KEs. Otherwise, the incorrect matching type
may lead to completely unreasonable results (see figure 3.17).
(a) Correct Matching-type (b) Incorrect Matching-types
Figure 3.17: Illustration of different matching types [11]
39
Each x-KE should match one of the y-KEs to construct a corner point, such
as (x
min
, y
min
), (x
2
, y
max
), and (x
max
, y
2
). Then, all four constructed corner points
are assembled for the final prediction, giving us the quadrilateral bounding box. It
is important to note that different orders of the corners would produce different
results. Hence, the total number of matching-types between the x-KEs and y-KEs
can be simply calculated by A
4
4
= 24.
Based on this, a simple yet effective MTL module is proposed to learn the con-
nections between x-KEs and y-KEs. Specifically, as shown in figure 3.16, the feature
maps that are used for predicting the x-KEs and y-KEs are used for classifying the
matching-types. Specifically, the output feature of the deconvolution layer is con-
nected to a convolutional layer having an
M
2
×
M
2
kernel size with 24 output channels.
Thus, the matching procedure is formed as a 24-category classification task.
C Re-scoring and post-processing
The fact that the detectors can sometimes output high confidence scores for false
positive samples is a long-standing issue in the detection community for both generic
objects and text. One possible reason for this may be that the scoring head used in
most of the current literature is supervised by the softmax loss, which is designed
for classification but not for explicit localization. Moreover, the classification score
only considers whether the instance is foreground or background, and it shows less
sensitivity to the compactness of the bounding box.
Therefore, a confidence RPP block, is proposed to suppress unreasonable false
positives. Specifically, RPP adopts a policy similar to multiple expert systems to
reduce the risk of outputting high scores for negative samples. In RPP, an OBD
score S
OBD
is first calculated based on eight KEs (four x-KEs and four y-KEs) the
refined confidence can be obtained by:
score =
(2 γ)S
box
+ γS
OBD
2
(3.2)
where 0 γ 2 is the weighting coefficient and S
box
is the original softmax
confidence for the bounding box. Because both S
box
and S
OBD
are both between
[0,1], the value of score(<) is also between [0,1]. Counting the S
OBD
into the final
score enables the proposed detector to draw lessons from multiple agents (eight KE
scores) while enjoying the benefits of a tightness-aware confidence supervised by the
KE prediction task.
3.1.4 FOTS (Fast Oriented Text Spotting)
FOTS is an end-to-end trainable model that detects and recognizes all words in
a natural scene image simultaneously.
Basically, FOTS = EAST [9] detector + ROIRotate + RNN-based recognizer
with CTC decoder.
An overview of the framework is illustrated in figure 3.18. The text detection
branch and recognition branch share convolutional features. The backbone of the
shared network is ResNet-50 [23]. Inspired by FPN [25], FOTS backbone network
concatenates low-level feature maps and high-level semantic feature maps. The res-
olution of feature maps produced by shared convolutions is
1
4
of the input image.
The text detection branch outputs dense per-pixel prediction of text using features
produced by shared convolutions. With oriented text region proposals produced by
detection branch, the proposed RoIRotate converts corresponding shared features
40
into fixed-height representations while keeping the original region aspect ratio. Fi-
nally, the text recognition branch recognizes words in region proposals. CNN and
LSTM are adopted to encode text sequence information, followed by a Connectionist
Temporal Classification (CTC) decoder.
Figure 3.18: Overall architecture. The network predicts both text regions and text
labels in a single forward pass [3].
A Text detection branch
Text detection branch is inspired by EAST [9]. As there are a lot of small text
boxes in natural scene images, feature maps are up-scaled from 1/32 to 1/4 size of
the original input image in shared convolutions. After extracting shared features,
one convolution is applied to output dense per-pixel predictions of words. The first
channel computes the probability of each pixel being a positive sample. Similar
to EAST [9], pixels in shrunk version of the original text regions are considered
positive. For each positive sample, the following 4 channels predict its distances to
top, bottom, left, right sides of the bounding box that contains this pixel, and the last
channel predicts the orientation of the related bounding box. Final detection results
are produced by applying thresholding and Non-Maximum Suppression (NMS) to
these positive samples.
The detection branch loss function is composed of two sterms: text classification
term and bounding box regression term. The text classification term can be seen
as pixel-wise classification loss for a down-sampled score map. Only shrunk version
of the original text region is considered as the positive area, while the area between
the bounding box and the shrunk version is considered as “NOT CARE”, and does
not contribute to the loss for the classification.
B RoIRotate
RoIRotate applies transformation on oriented feature regions to obtain axis-
aligned feature maps, as shown in figure 3.19. In this work, output height is fixed
and the aspect ratio is kept unchanged to deal with the variation in text length.
RoIRotate pooling process can be divided into two steps. First, affine transforma-
tion parameters are computed via predicted or ground truth coordinates of text
proposals. Then, affine transformations are applied to shared feature maps for each
region respectively, and canonical horizontal feature maps of text regions are ob-
tained. The first step can be formulated as:
t
x
= l × cos(θ) t × sin(θ) x (3.3)
t
y
= t × cos(θ) + l × sin(θ) y (3.4)
s =
h
t
t + b
(3.5)
41
Figure 3.19: Illustration of RoIRotate. Here the authors used the input image to
illustrate text locations, but it is actually operated on feature maps in the network.
Best view in color [3].
w
t
= s × (l + r) (3.6)
M =
cos(θ) sin(θ) 0
sin(θ) cos(θ) 0
0 0 1
s 0 0
0 s 0
0 0 1
1 0 t
x
0 1 t
y
0 0 1
(3.7)
= s
cos(θ) sin(θ) t
x
× cos(θ) t
y
× sin(θ)
sin(θ) cos(θ) t
x
× sin(θ) + t
y
× cos(θ)
0 0
1
s
(3.8)
where M is the affine transformation matrix. h
t
, w
t
represent height (equals 8
in author’s setting) and width of feature maps after affine transformation. (x, y)
represents the coordinates of a point in shared feature maps and (t, b, l, r) stands
for distance to top, bottom, left, right sides of the text proposal respectively, and θ
for the orientation. (t, b, l, r) and θ can be given by ground truth or the detection
branch.
With the transformation parameters, it is easy to produce the final RoI feature
using the affine transformation:
x
s
i
y
s
i
1
= M
1
x
t
i
y
t
i
1
(3.9)
and for i [1...h
t
], j [1...w
t
], c [1...C],
V
c
ij
=
h
s
X
n
w
s
X
m
U
c
nm
k(x
s
ij
m; Φ
x
)k(y
s
ij
n; Φ
y
) (3.10)
42
where V
c
ij
is the output value at location (i, j) in channel c and U
c
nm
is the input
value at location (n, m) in channel c. h
s
, w
s
represent the height and width of
the input, and Φ
x
, Φ
y
are the parameters of a generic sampling kernel k() (bilinear
interpolation here). As the width of text proposals may vary, in practice, we pad
the feature maps to the longest width and ignore the padding parts in recognition
loss function (for batching).
C Text recognition branch
The text recognition branch aims to predict text labels using the region features
extracted by shared convolutions and transformed by RoIRotate. Considering the
length of the label sequence in text regions, input features to LSTM are reduced
only twice along width axis through shared convolutions from the original image.
Otherwise discriminable features in compact text regions, especially those of narrow
shaped characters, will be eliminated.
The text recognition branch consists of VGGlike [26] sequential convolutions,
poolings with reduction along height axis only, one bi-directional LSTM [27], one
fully-connection and the final CTC decoder [28].
3.1.5 ContourNet
ContourNet mainly consists of three parts: Adaptive-RPN, LOTM and Point
Re-scoring Algorithm. In this section, we first briefly see the overall pipeline of
the proposed method, and then detail the motivation and implementation of these
three parts respectively.
A Overall pipeline
The architecture of the ContourNet is illustrated in figure 3.20. First, a back-
bone network is constructed to generate shared feature maps. Inspired by Feature
Pyramid Network (FPN) [25] which can obtain strong semantic features for multi-
scale targets, a backbone with FPN-like architecture is constructed by implementing
lateral connections in the decoding layer. Next, an Adaptive-RPN is proposed for
proposal generation by bounding spatial extent of several refined points. The input
of LOTM are proposal features obtained by using Deformable RoI pooling [29]
and bilinear interpolation to the shared feature maps. Then, LOTM decodes the
contour points from proposal features by modeling the local texture information in
horizontal and vertical directions respectively. Finally, a Point Re-scoring Algorithm
is used to filter FPs by considering the responses in both directions simultaneously.
Bounding box regression and classification (text/non-text) in box branch are similar
to other 2-stage methods, which are used to further refine bounding boxes.
B Adaptive region proposal network
RPN is widely used in existing object detection methods. It aims to pre-
dict a 4-d regression vector {x, y, w, h} to refine current bounding box pro-
posal B
c
= {x
c
, y
c
, w
c
, h
c
} to a predicted bounding box B
t
= {x
c
+ w
c
x
c
, y
c
+
h
c
y
c
, w
c
e
w
c
, h
c
e
h
c
}, and the training objective is to optimize the smooth l1
loss [16].
The authors of ContourNet said that using l
n
norm loss (smooth l
1
loss in
particular) is not optimal for improving Intersection Over Union (IoU) between
43
Figure 3.20: The pipeline of ContourNet. It mainly contains three parts: Adaptive
Region Proposal Network (Adaptive-RPN), Local Orthogonal Texture-aware Mod-
ule (LOTM) and Point Re-scoring Algorithm. The box branch is similar to other
2-stage methods [30].
detection and label bounding boxes. Specifically, several pairs of bounding boxes
in different scales with the same IoU value may have different l
n
norm distances.
They said that it makes it hard for CNN-based methods to learn samples with large
scale variance in scene text detection.
To handle this problem, a new Adaptive-RPN was proposed to focus on only IoU
values between predicted and ground-truth bounding boxes which is a scale-invariant
metric, and use a set of pre-defined points P = {(x
l
, y
l
)}
n
l=1
(1 center point and n1
boundary points) instead of the 4 d vector for the proposal representation. The
refinement can be expressed as:
R = {x
r
, y
r
}
n
r=1
= {(x
l
+ w
c
x
l
, y
l
+ h
c
y
l
)}
n
l=1
(3.11)
Where {x
l
, y
l
}
n
l=1
are the predicted offsets to predefined points, w
c
and h
c
are width and height of current bounding box proposal. As shown in figure 3.21,
the predicted offsets are used to process a local refinement on n predefined points
in current bounding box proposal. Then, a max-min function in equation 3.12 is
used to bound these refined points with 4 extreme points for the representation of
predicted bounding box. Specially, the center point {x
0
, y
0
} is used to normalize the
bounding box (e.g. if x
tl
> x
0
, then x
tl
= x
0
).
P roposal = {x
tl
, y
tl
, x
rb
, y
rb
} (3.12)
= {min{x
r
}
n
r=1
, min{y
r
}
n
r=1
, max{x
r
}
n
r=1
, max{y
r
}
n
r=1
} (3.13)
Compared with conventional RPN that considers only rectangular spatial scope,
the proposed Adaptive-RPN automatically accounts for shape and semantically im-
portant local areas for finer localization of text regions. Without additional supervi-
sion, the regression loss in Adaptive-RPN is optimized only through an IoU loss by
calculating the overlapping between the predicted and groundtruth bounding boxes.
C Local orthogonal texture-aware module
Inspired by traditional edge detection operators (e.g. Sobel, etc.) which have
achieved remarkable performance before deep learning becomes the most promising
44
Figure 3.21: The comparison between conventional RPN (left) and Adaptive-RPN
(right). The proposed Adaptive-RPN adaptively regresses the offsets to pre-defined
points. Predicted bounding box is generated by bounding the spatial extend of
refined points. Red points are pre-defined points in current bounding box proposal
(e.g. center point in conventional RPN and pre-defined points P in Adaptive-RPN),
and green points are refined points. The yellow dotted lines indicate the regressed
offsets [30].
machine learning tool, the idea of traditional edge detection operators is incopo-
rated into LOTM and represent text region with a set of contour points. These
points containing strong texture characteristics that can accurately localize texts
with arbitrary shapes.
As shown in figure 3.22, LOTM contains two parallel branches. In the top
branch, we slide a convolutional kernel with size 1 × k over the feature maps to
model the local texture information in horizontal direction, which only focuses on
the texture characteristics in a k-range region. This local operation is proved to
be powerful in the experiments, and because of the small amount of computation,
it also keeps the efficiency of the method. In a like manner, the bottom branch
is constructed to model the texture characteristics in vertical direction through a
convolutional kernel with size k × 1. k is a hyper-parameter to control the size of
receptive field of texture characteristics. Finally, two sigmoid layers are implemented
to normalize the heatmaps to [0, 1] in both directions. In this way, text regions can
be detected in two orthogonal directions and represented with contour points in
two different heatmaps, either of which only responds to texture characteristics in
a certain direction.
D Point re-scoring algorithm
As false-positive predictions can be effectively suppressed by considering the
response value in both orthogonal directions, two heatmaps from LOTM are further
processed through Point Re-scoring Algorithm: Points in different heatmaps are
first processed through NMS to achieve a tight representation. Then, to suppress
the predictions with strong unidirectional or weakly orthogonal response, we only
45
Figure 3.22: The visualization of LOTM (left). Point Re-scoring Algorithm (right)
is only used in testing stage [30].
select the points with distinct response in both heatmaps as candidates. Finally,
text region can be represented with polygon made up by these high-quality contour
points.
3.1.6 CRAFT and CRAFTS
CRAFT [31] is a scene text detector and CRAFTS [1] is a scene text spotter.
Because CRAFTS is a new and improved version of CRAFT, so in this section, we
will only introduce an overview of the CRAFTS model.
Proposed CRAFTS network can be divided into three stages; detection stage,
sharing stage, and recognition stage. A detailed pipeline of the network is illus-
trated in figure 3.23. Detection stage takes an input image and localizes oriented
text boxes. Sharing stage then pools backbone high-level features and detector out-
puts. The pooled features are then rectified using the rectification module, and are
concatenated together to form a character attended feature. In the recognition stage,
attention-based decoder predicts text labels using the character attended feature.
Finally, a simple post-processing technique is optionally used for better visualization.
A Detection stage
CRAFT detector [31] is selected as a base network because of its capability of rep-
resenting semantic information of the character regions. The outputs of the CRAFT
network represent center probability of character regions and linkage between them.
The authors said that this character centeredness information can be useful for at-
tention module in recognition module since both components aim to localize the
center position of characters. In fact, CRAFTS had three changes compared to the
original CRAFT model; backbone replacement, link representation, and orientation
estimation:
Backbone replacement: replace VGG-16D [26] with ResNet-50 [23]
46
Figure 3.23: Schematic overview of CRAFTS pipeline [1]
Link representation: The occurrence of vertical texts is not common in
Latin texts, but it is frequently found in East Asian languages like Chinese,
Japanese, and Korean. In this work, a binary center line is used to connect the
sequential character regions. This change was made (compared to CRAFT)
because employing the original affinity maps on vertical texts often produced
ill-posed perspective transformation that generated invalid box coordinates
(see CRAFT for how it modelize the linkage between characters). To gener-
ate ground truth linkmap, a line segment with thickness t is drawn between
adjacent characters. Here, t = max(
(d
1
+d
2
)
2
× α, 1), where d
1
and d
2
are the
diagonal lengths of adjacent character boxes and α is the scaling coefficient.
Use of the equation lets the width of the center line proportional to the size
of the characters.
Orientation estimation: It is important to obtain the right orientation of
text boxes since the recognition stage requires well-defined box coordinates to
recognize the text properly. Therefore, the authors added 2-channel output
map called orientation map which is used to predict angles of characters along
the x-axis, y-axis each. To generate the ground truth of orientation map, given
the upward angle of the Ground Truth (GT) character bounding box being
represented as θ
box
, the channel predicting x-axis has a value of S
cos
(p) =
(cosθ + 1) × 0.5, and the channel predicting y-axis has a value of S
sin
(p) =
(sinθ + 1) × 0.5. The ground truth orientation map is generated by filling the
pixels p in the region of the text box with the values of S
cos
(p) and S
sin
(p).
The trigonometric function is not directly used to let the channels have the
same output range with the region map and the link map; between 0 and 1.
The loss function for orientation map is calculated by equation 3.14.
L
θ
= S
r
(p).(||S
sin
(p) S
sin
(p)||
2
2
+ ||S
cos
(p) S
cos
(p)||
2
2
) (3.14)
where S
sin
(p) and S
cos
(p) denote the ground truth of text orientation. Here,
the character region score S
r
(p) is used as a weighting factor because it represents
the confidence of the character centeredness. By doing this, the orientation loss is
calculated only in the positive character regions.
The final objective function in the detection stage L
det
is defined as,
L
det
= L
r
+ L
l
+ λL
θ
(3.15)
47
where L
r
and L
l
denote character region loss and link loss, which are exactly
same in CRAFT. The L
θ
is the orientation loss, and is multiplied with λ to control
the weight.
The architecture of the backbone and modified detection head is illustrated in
figure 3.24. The final output of the detector has four channels, each representing
character region map S
r
, character link map S
l
, and two orientation maps S
sin
, S
cos
.
Figure 3.24: The backbone of CRAFTS [1]
During inference, The authors applied the same post-processing as described in
CRAFT to obtain text bounding boxes. First, by using predefined threshold values,
we make binary maps of character region map S
r
and character link map S
l
. Then,
using the two maps, the text blobs are constructed by using Connected Components
Labeling (CCL). The final boxes are obtained by finding a minimum bounding box
enclosing each text blob. We additionally determine the orientation of the bounding
box by utilizing pixel-wise averaging scheme. As shown in the 3.16, the angle of the
text box is found by taking the arctangent of accumulated sine and cosine values at
the predicted orientation map.
θ
box
= arctan(
P
(S
r
(p) × (S
sin
(p) 0.5))
P
(S
r
(p) × (S
cos
(p) 0.5))
) (3.16)
θ
box
denotes orientation of the text box, S
cos
and S
sin
are the 2-ch orientation
outputs. The same character centerdeness-based weighting scheme that used in the
loss calculation is applied to predict the orientation as well.
B Sharing stage
Sharing stage consists of two modules: text rectification module and character
region attention (CRA) modules. To rectify arbitrarily-shaped text region, a Thin-
Plate Spline (TPS) transformation is used. Inspired by the work of [32], CRAFTS’
rectification module incorporates iterative-TPS to acquire a better representation of
the text region. By updating the control points attractively, the curved geometry of
a text in an image becomes ameliorated. Typical TPS module takes an word image
as input, but CRAFTS is feeded with the character region map and link map since
they encapsulate geometric information of the text regions.
CRA module is the key component that tightly couples detection and recogni-
tion modules. By simply concatenating rectified character score map with feature
representation, the model establishes following advantages. Creating a link between
detector and recognizer allows recognition loss to propagate through detection stage,
and this improves the quality of character score map. Also, attaching character re-
gion map to the feature helps recognizer attend better to the character regions.
48
C Recognition stage
The modules in the recognition stage are formed based on the results reported
in [33]. There are three components in the recognition stage: feature extraction,
sequence modeling, and prediction. The feature extraction module is made lighter
than a solitary recognizer since it takes high-level semantic features as input. De-
tailed architecture of the module is shown in 3.1. After extracting the features, a
bidirectional LSTM is applied for sequence modeling, and attention-based decoder
makes a final text prediction.
Table 3.1: A simplified ResNet feature extraction module [1]
3.1.7 Comparison
The table 3.2 depicts general comparisons between different text detection and
text spotting models.
Table 3.2: Comparison between different text detection and text spotting models
Here are some of our remarks about those different models:
CharNet. Its accuracy in different benchmarks is pretty good but:
It doesn’t incorporate sequence modeling for recognition module. There-
fore, the model’s ability to recognize textlines of characters is limited.
49
Its architecture is complex with different modules, branches and even sub-
branches. Hence, Its processing time in inference phase is significantly
slow.
Its performance in inference phase is negatively affected by using current
post-processing algorithm with how it manages to group different char
boxes into the same text line.
It still relies on binary mask which is not good enough compared to soft
mask (with value in a real range [0, 1]).
Lack of training datasets with enough annotation formats
Its type of architecture is 2.
PMTD:
The plane algorithm is computationally expensive. It takes a lot of time
to find four supporting planes for the pyramids in an iterative manner.
Modelizing textlines of characters as pyramids with apex 1 in a soft mask
is not at enough high level of abstraction and generality (imagine having
two text lines with the two modelized pyramids with the same height of
apex, one is a character and another is a really long text line).
OBD:
Its type of architecture is 1.
Its architecture is complex. It’s based on Mask RCNN which is already
a complex model. Then, the authors modified it by adding three more
complex blocks to avoid inconsistent labeling. Hence, model is slow in
inference phase.
It doesn’t make use of soft mask. Instead, It only uses quadrilateral label
for its three more blocks to modelize its output as Key Edges.
FOTS. In comparison with CRAFTS, FOTS have some weaknesses:
EAST is not as good as CRAFT in terms of accuracy.
Its type of architecture is 2.
It doesn’t rectify curved text lines before feeding pooled features to rec-
ognizer.
ContourNet:
It’s time consuming to create training labels.
This method focuses on extracting proposal features in horizontal and
vertical directions. This will somehow not generalize to characters of any
shape.
Its type of architecture is 1.
CRAFTS:
Its type of architecture is 3.
Its sharing module to connect detection and recognition module is com-
plex and has redundant components.
50
3.2 Recognition models
3.2.1 CRNN
The network architecture of CRNN, as shown in figure 3.25, consists of three
components from bottom to top respectively:
the convolutional layers
the recurrent layers
a transcription layer
Figure 3.25: The architecture of CRNN [12]
At the bottom of CRNN, the convolutional layers automatically extract a feature
sequence from each input image. On top of the convolutional network, a recurrent
network is built for making prediction for each frame of the feature sequence, out-
putted by the convolutional layers. The transcription layer at the top of CRNN is
adopted to translate the per-frame predictions by the recurrent layers into a label
sequence. Though CRNN is composed of different kinds of network architectures
(eg. CNN and RNN), it can be jointly trained with one loss function.
A Feature sequence extraction
In CRNN model, the component of convolutional layers is constructed by tak-
ing the convolutional and max-pooling layers from a standard CNN model (fully-
connected layers are removed). Such component is used to extract a sequential
feature representation from an input image. Before being fed into the network, all
the images need to be scaled to the same height. Then a sequence of feature vectors
51
is extracted from the feature maps produced by the component of convolutional
layers, which is the input for the recurrent layers. Specifically, each feature vector
of a feature sequence is generated from left to right on the feature maps by column.
This means the i-th feature vector is the concatenation of the i-th columns of all the
maps. The width of each column in our settings is fixed to single pixel. As the layers
of convolution, max-pooling, and elementwise activation function operate on local
regions, they are translation invariant. Therefore, each column of the feature maps
corresponds to a rectangle region of the original image (termed the receptive field),
and such rectangle regions are in the same order to their corresponding columns on
the feature maps from left to right. As illustrated in figure 3.26, each vector in the
feature sequence is associated with a receptive field, and can be considered as the
image descriptor for that region.
Figure 3.26: The receptive field [12]
B Sequence labeling
A deep bidirectional RNN is built on the top of the convolutional layers, as
the recurrent layers. The recurrent layers predict a label distribution y
t
for each
frame x
t
in the feature sequence x = x
1
, ..., x
T
. The advantages of the recurrent
layers are three-fold. Firstly, RNN has a strong capability of capturing contextual
information within a sequence. Using contextual cues for image-based sequence
recognition is more stable and helpful than treating each symbol independently.
Taking scene text recognition as an example, wide characters may require several
successive frames to fully describe (refer to figure 3.26). Besides, some ambiguous
characters are easier to distinguish when observing their contexts, e.g. it is easier
to recognize “il” by contrasting the character heights than by recognizing each of
them separately. Secondly, RNN can back-propagates error differentials to its input,
i.e. the convolutional layer, allowing us to jointly train the recurrent layers and
the convolutional layers in a unified network. Thirdly, RNN is able to operate on
sequences of arbitrary lengths, traversing from starts to ends. To avoid the vanishing
gradient problem, LSTM unit will be used in this module. Specifically, sequence
labeling module will be a 2-layer bidirectional LSTM network.
C Transcription
Transcription is the process of converting the per-frame predictions made by
RNN into a label sequence. Mathematically, transcription is to find the label se-
quence with the highest probability conditioned on the per-frame predictions. In
practice, there exists two modes of transcription, namely the lexicon-free and lexicon-
based transcriptions. A lexicon is a set of label sequences that prediction is con-
straint to, e.g. a spell checking dictionary. In lexicon-free mode, predictions are
52
made without any lexicon. In lexicon-based mode, predictions are made by choos-
ing the label sequence that has the highest probability.
In fact, transcription layer is just a FCN which maps from contextual vector of
size C to transcription vector of size T (number of timesteps).
3.2.2 RARE
Overall, RARE [13] takes an input image I and outputs a sequence l = (l
1
, ..., l
T
),
where l
t
is the t-th character, T is the variable string length.
RARE is in essence a improved version of CRNN model with two improvements:
Firstly, it integrates STN to rectify curved text (as in sharing module of
CRAFTS).
Instead of using CTC decoder, RARE uses attention-based decoder after 2-
layer bidirectional LSTM network. They call this whole module the Sequence
Recognition Network.
Figure 3.27 depicts the structure of the SRN network.
Figure 3.27: Structure of the SRN, which consists of an encoder and a decoder. The
encoder uses several convolution layers (ConvNet) and a two-layer BLSTM network
to extract a sequential representation (h) for the input image. The decoder generates
a character sequence (including the EOS token) conditioned on h [13].
53
Chapter 4
Proposed Solutions and
Improvements
4.1 Proposed solutions
In the last chapter, we learned about some of the most popular current methods
for scene text detection and recognition. We have also made the most objective com-
parisons possible based on the information provided by the those scientific papers.
Therefore, in this chapter, we will base on the available information from chap-
ters 1, 2 and 3, together with additional observations and arguments to propose a
reasonably effective solution to the current problem.
4.1.1 Remarks
After carefully consulting some of today’s most effective and popular methods for
scene text detection and recognition and trying to make the most objective reviews
about them, we came to some of the following conclusions:
CRAFTS [1] is one of the latest text spotting models and it achieved state-of-
the-art performance on several benchmark datasets.
CRAFTS’ authors applied new type of architecture (architecture 3, see 3 for
additional information). As far as we know, this is the first model to have such
an architecture that boosts the end-to-end training process more effectively
than other types of architecture.
CharNet is a really good model but it is much slower compared to CRAFTS.
It doesn’t utilize the sequence characteristic of text lines of characters.
FOTS is also a good model. However, it aims to solve ‘curved-text detection
and recognition’ which is in fact not our purpose (arbitrary-shape text spot-
ting) in this thesis (see 1). Moreover, EAST is less efficient than CRAFT
and other CRAFT-based detectors. Additionally, RoIRotate is not as good as
Sharing module in pooling and rectifying ‘curved text’.
PMTD, OBD, and ContourNet are three detection models of the first archi-
tectural type. Hence, it’s not recommended to use them for our solution.
It’s about the model, but how about the targeted object? In fact, researchers
often focus on developing models for Latin characters first. Hence, their model is
54
also optimized for these objects. However, there are a few obvious differences in
the problem of detecting and recognizing Han Nom characters compared to Latin
characters that we need to pay attention to when proposing an adaptive solution:
1. About the number of character classes: The total number of Latin character
classes in the English alphabet table is 26 (from a to z) and the number of
classes of Chinese characters (or Han Nom) is tens of thousands of characters.
Statistics show that the number of character classes that a Chinese language
student knows is between 3,000 and 4,000 character classes. Therefore, the
recognition of Han Nom characters will: change the model size, change the
model structure and lead to data imbalance problem (the number of objects
of the classes is unbalanced in the training datasets).
2. About the space between two adjacent characters in the character sequence:
Redundant space between characters will affect the performance of the text
line recognition model after pooling from shared features of previous layers in
the network. Figure 4.1 illustrates this difference.
Figure 4.1: The space between adjacent characters
1
1
https://rrc.cvc.uab.es/
55
3. The range of the Chinese character sequence length is greater than that of the
Latin character sequence. This makes it more difficult to model the sequence
as well as create effective input to the identity model. Specifically, the images
containing Chinese character sequences will most likely not be able to keep
the original aspect ratio due to having to resize the width to fit the input size
of the model. Figure 4.2 depicts this observation.
Figure 4.2: Long sequence of characters is common in Chinese [34].
4. Latin characters are sometimes lined up vertically. However this is much more
common in Han Nom characters. More importantly, although these character
sequences appear a lot in the context of hieroglyphs in general (Chinese, Han
Nom) or targeted character sequences in temples and shrines (such as distichs
in the altar or at the temple) in particular, they have not been studied much.
It is difficult to find a scientific paper that suggests a solution, or even just
referring to vertical textline detection and recognition. Let’s see figure 4.3 for
some examples.
Figure 4.3: Chinese vertical textline recognition has not been studied much yet
2
.
5. Finally, about the direction of the characters: the direction of Han Nom char-
acters usually has only one direction, from bottom to top (upward character),
2
http://ecthaibinh.com/
56
while Latin characters can be oriented 360 degrees. Figures 4.4 and 4.5 illus-
trate the orientation of some Latin and Chinese characters.
Figure 4.4: The orientation of Latin characters [1]
(a) Upward characters are common
in real life
3
.
(b) Oriented characters are rarely
encountered
4
.
Figure 4.5: The orientation of Chinese characters
4.1.2 Adaptive solution
In order to tackle all the mentioned problems, our proposed solutions would be:
1. design a text spotting model based on CRAFTS [1], we call this new model
UMATS which stands for Unified Model for Arbitrary-shape Text
Spotting. Its architecture is illustrated in figure 4.8.
2. As stated in [35],
There is no public report on effectively applying the attention mech-
anism to deal with the large-scale category text recognition tasks,
such as Chinese text recognition.
4
https://rrc.cvc.uab.es/
4
https://www.nationsonline .org/
57
Moreover, they said that:
For long text sequences, the attention mechanism is difficult to train
from scratch owing to the misalignment between the input instance
image and the output text sequences, i.e., the attention drift phe-
nomenon.
Therefore, for multiclass language with long textlines, we replace attention-
based decoder with CTC decoder in the recognition stage of CRAFTS (sec.
3.1.6).
3. Inspired by [5], we propose a new sharing module (sec. 3.1.6) called Con-
nector (or Connection module, Perspective Character RoI Pooling)
which is actually a meticulously designed function of perspective RoI pooling.
This module will pool each extracted feature region of arbitrarily oriented
character from shared feature maps obtained by detection module in order
to make this region representing an upward character features. Then, this
module will concatenate these pooled regions of all characters belonging to
the same textline based on the appearance order of these characters in the
textline. Hence, it can solve three previously mentioned problems: the gap
between characters, the vertical textlines and the arbitrary-shape textlines.
Figure 4.6 depicts the comparisons between the normal pooling used in other
methods and our proposed pooling in connector.
(a) The result of normal pooling for ver-
tical textline is still a vertical extracted-
feature region which is difficult to be rec-
ognized by a recognizer.
(b) The normal pooling can not remove
gaps between characters and it cannot
rectifies the curved text lines either.
This will bring noises to the recognizer.
(c) The proposed pooling method puts
pooled feature regions of characters into
a horizontal row that is a suitable input
for the recognizer.
(d) The proposed pooling p erfectly pools
useful feature regions into a horizontal
row. Thus, it makes feature inputs of
the recognizer clean.
Figure 4.6: The comparisons between the normal pooling used in other methods
and our proposed pooling in connector. We can see that the proposed method is
more general and effective. Here we use the input image to illustrate text locations,
but it is actually operated on feature maps in the network.
4. To know the order in which the characters in the textline appear, we proprose a
58
new output map for the detection stage of CRAFTS (sec. 3.1.6) called Order
map. An example of the order map is shown in figure 4.7
Figure 4.7: The example of an order map in colormap JET. The warmer the color,
the closer the character is to the top of the textline.
5. Because the proposed pooling’s operating mechanism already helps to deal
with arbitrary-shape text, we decide to remove the redundant rectification
module and Character Region Attention from CRAFTS.
4.2 Unified Model for Arbitrary-shape Text Spot-
ting
4.2.1 Overview
The proposed UMATS model can be divided into three components: detection
component, connection component, and recognition component. An architectural
overview of the model is shown in figure 4.8. Detection component takes an input
image and localizes arbitrary-shape textline and character boxes. Connection com-
ponent then pools backbone high-level features and detector outputs. Each sequence
of pooled feature regions of characters belonging to the same textline are then con-
catenated together to form a textline pooled feature. In the recognition component,
CTC decoder predicts textline labels using the previously obtained features. Finally,
a simple post-processing algorithm is used for visualizing text spotting result. Note
that loss from the recognition component will be back-propagated into the entire
network architecture thanks to the newly-designed connection component.
For ease of description, from now on we will denote h, w, c as height, width and
number of channels of the input image of UMATS model, respectively.
4.2.2 Detector
The architecture of the detector is almost the same as in [1]. The only difference
is that now we will have four types of output maps instead of three maps:
1. Region map: A region map is a 3-dimensional matrix of shape (h/2, w/2, 1)
where each of its pixels has a value representing the probability that the pixel
is at the center of a character in the input image.
59
Figure 4.8: The overview of the proposed UMATS architecture. Blue arrows are
representing forward pass. Green paths are representing loss-calculation paths. Red
arrows are representing backward pass. Here we visualize the inputs of connector
and recognizer as images to make it easier to understand, but they are actually
feature maps.
Figure 4.9: The detailed architecture of the proposed UMATS
2. Link map: A link map is a 3-dimensional matrix of shape (h/2, w/2, 1) where
each of its pixels has a value representing the probability that the pixel is at
the central area of a textline in the input image.
3. Orientation map: An orientation map is a 3-dimensional matrix of shape
(h/2, w/2, 2) where some of its pixels has two values representing respectively
the encoded cosine and sine of the orientation angle of a character in the input
image. The other pixels have undefined values because they don’t lie in any
60
character regions. The trigonometric values is not directly used to let the
those encoded values have the same range with the region map, link map, and
order map; between 0 and 1. Sometimes, we consider this map as two separate
maps; cos map and sin map.
4. Order map: An order map is our proposed map. It is a 3-dimensional matrix
of shape (h/2, w/2, 1) where some of its pixels has a value representing the
probability that the pixel is at the top of the textline in the input image.
The other pixels have undefined values because they don’t lie in any character
regions.
Figure 4.10 shows an example of region map, link map, and order map in some
colormaps.
Figure 4.10: Visualization of region, link, and order maps. The first column, the
second column and the last column respectively represent maps in grayscale col-
ormap, JET colormap, and a blended version with input image.
To visually represent the character’s orientation angle, we often use Angle map.
Angle map is a 3-dimensional matrix of shape (h/2, w/2, 1) where some of its pixels
has a value representing the orientation angle of a character in the input image. The
other pixels have undefined values because they don’t lie in any character regions.
The value range of the orientation angle is converted from [0, 360] to [0, 1] based
on cos map and sin map. Figure 4.11 depicts maps relating to orientation angle of
characters.
The architecture of the backbone and modified detection head of the UMATS
model is illustrated in figure 4.12. The final output of the detector has five channels,
each representing character region map S
r
, character link map S
l
, cos map S
cos
, sin
map S
sin
, and order map S
o
.
Recent studies show that the use of ResNet50 captures well-defined feature rep-
resentations of both the detector and the recognizer [36]. Therefore, we continue
using ResNet50 with batch normalization as our backbone for the detection compo-
nent. The model keeps using skip connections in the decoding part as UNet [37].
Its purpose is the same as other encoder-decoder architectures; it aggregates high-
spatial-understanding feature maps with high-semantic-understanding feature maps
in order to know what an object is and where it is in the input image.
61
Figure 4.11: Visualization of orientation-angle-related maps. Input image is also
illustrated for convenient consultation.
Regarding the issue of upscaling the spatial size of a feature map, there are several
methods such as transposed convolution and interpolation. However, as mentioned
in Deconvolution and Checkerboard Artifacts,
Now, neural nets typically use multiple layers of deconvolution when cre-
ating images, iteratively building a larger image out of a series of lower
resolution descriptions. While it’s possible for these stacked deconvolu-
tions to cancel out artifacts, they often compound, creating artifacts on
a variety of scales.
and
Another approach is to separate out upsampling to a higher resolution
from convolution to compute features. For example, you might resize
the image (using nearest-neighbor interpolation or bilinear interpolation)
and then do a convolutional layer. This seems like a natural approach,
and roughly similar methods have worked well in image super-resolution.
Therefore, we interleave bilinear interpolation layers (UpSample modules in figure
4.12) with UpConv Blocks to avoid checkerboard artifacts.
This encoder-decoder architecture is followed by a prediction head with four
contiguous convolutional layers as shown in figure 4.12. All convolutional layers are
designed with appropriate padding values to preserve its output spatial sizes.
In the training phase, we compare each prediction map with its correspond-
ing groundtruth map by using Mean Square Error (MSE) loss. The procedures of
creating these groundtruth maps are as follows:
1. GT region map S
r
. Figure 4.13 depicts the GT region map generation
procedure:
create a 2-dimensional isotropic Gaussian map
compute perspective transformation matrix to transform from the Gaus-
sian map created above to a GT character bounding box
warp this Gaussian map to the character box
do the same process for all other GT character bounding boxes
62
Figure 4.12: The architecture of UMATS detector
2. GT link map S
l
. A link map is created by drawing a straight line between
two centers of two adjacent character bounding boxes in a textline. Its thick-
ness is given by the following formula:
t = max(
d
11
+ d
12
+ d
21
+ d
12
4
× α, 1), (4.1)
where d
11
, d
12
, d
21
, d
22
are the diagonal lengths of adjacent character boxes
and α is the scaling coefficient. Use of the equation lets the thickness of the
63
Figure 4.13: The generation process of GT region map. Here the region map and
the Gaussian map are shown in JET colormap but they are actually in grayscale
colormap during data preparation process.
link line proportional to the size of the characters. We set α as 0.1 in our
implementation.
The illustration of this generation process is shown in figure 4.14.
Figure 4.14: The generation process of GT link map. Here the link map is shown in
JET colormap but they are actually in grayscale colormap during data preparation
process.
3. GT orientation map S
cos
and S
sin
. If we denote the upward angle of a
GT character bounding box as θ
box
. Then for each pixel in the character
region, its values in S
cos
and S
sin
will be S
cos
(p) = (cos(θ
box
) + 1) × 0.5 and
S
sin
(p) = (sin(θ
box
) + 1) × 0.5, respectively. The trigonometric values are not
directly used to let the map values have the same output range with the region
map and the link map; between 0 and 1. Take a look at figure 4.15 for the
illustration of the generation process!
4. GT order map S
o
. Each pixel in GT character bounding box has a value
corresponding to character distance to the top of the textline. Figure 4.16
illustrates an example of all these generated maps.
The loss function for the proposed order map is calculated by equation 4.2.
L
o
= S
r
(p).(||S
o
(p) S
o
(p)||
2
2
) (4.2)
This loss function is in essence MSE loss. However, we additionally add S
r
(p) as
a weighting factor. The reason is that the pixels which are outside of the character
regions have undefined values and consequently should not contribute to the loss.
The same weighting factor is also used in orientation loss.
64
Figure 4.15: The generation process of GT orientation map. Angle map is usually
used for illustration instead.
Figure 4.16: Example of the GroundTruth maps of a training image
The final objective function in the detection component L
det
is defined as,
L
det
= L
r
+ L
l
+ λ
θ
L
θ
+ λ
o
L
o
L
r
= S
c
(p)(||S
r
(p) S
r
(p)||
2
2
)
L
l
= S
c
(p)(||S
l
(p) S
l
(p)||
2
2
)
L
θ
= S
c
(p)S
r
(p)(||S
sin
(p) S
sin
(p)||
2
2
+ ||S
cos
(p) S
cos
(p)||
2
2
)
L
o
= S
c
(p)S
r
(p)(||S
o
(p) S
o
(p)||
2
2
)
65
where L
r
, L
l
, L
θ
are respectively region, link and orientation losses which are exactly
same as in [1]. Parameters λ
θ
and λ
o
are added to the loss function to adjust
the effect of L
θ
and L
o
on the value of the detection loss, respectively. In our
implementation, we set these two values to 1.
In popular datasets, they usually contains training images with labeled ignored
textlines; those that are unrecognizable by human’s eyes. Thus, S
c
(p), which stands
for confidence map, is introduced to prevent those regions from affecting the de-
tection loss. More general, it is simply a map showing the confidence levels on the
training labels (GT maps) in a range [0, 1] in which 0 means can not be trusted and
1 means can be completely trusted. However, under what circumstances can we not
trust the training labels? Here are two typical cases:
The situation when there are ignored textline regions in the input image.
In this case, GT map generation processes misunderstand and asign wrong
values for these regions. For example, region map is assigned 0s for ignored
text regions but these regions do contain characters. The reason is, due to the
lack of characer bounding boxes and/or transcriptions for ignored text regions,
GT map generation processes create wrong GT maps we can not trust totally.
The second situtation is that the GT training maps are created by weakly-
supervised learning process, not by raw labels. These training maps are then
called pseudo GT training maps which can not be trusted totally (see [31] for
additional information).
In the inference phase, depending on the desired output, we will use correspond-
ing algorithm:
1. For generating textline bounding boxes. First, the binary map M cover-
ing the image is initialized with 0. M(p) is set to 1 if S
r
(p) > τ
r
or S
l
(p) > τ
l
,
where τ
r
is the region threshold and τ
l
is the link threshold. Second, CCL on
M is performed. Lastly, textline bounding box is obtained by finding a rotated
rectangle with the minimum area enclosing the connected components corre-
sponding to each of the labels. The functions like connectedComponents
and minAreaRect provided by OpenCV can be applied for this purpose.
Note that an advantage of UMATS is that it does not need any further post-
processing methods, like NMS. Since we have image blobs of word regions
separated by CCL, the bounding box for a word is simply defined by the
single enclosing rectangle.
2. For generating polygon for each textline. The procedure of polygon
generation is illustrated in figure 4.17. The first step is to find the local
maxima line of character regions along the scanning direction, as shown in the
figure with arrows in blue. The lengths of the local maxima lines are equally
set as the maximum length among them to prevent the final polygon result
from becoming uneven. The line connecting all the center points of the local
maxima is called the center line, shown in yellow. Then, the local maxima lines
are rotated to be perpendicular to the center line to reflect the tilt angle of
characters, as expressed by the red arrows. The endpoints of the local maxima
lines are the candidates for the control points of the text polygon. To fully
cover the text region, we move the two outer-most tilted local maxima lines
outward along the local maxima center line, making the final control points
(green dots).
66
Figure 4.17: Polygon generation for arbitrarily-shaped texts [31]
3. For generating character bounding boxes. First, we find raw character
bounding boxes by using the region map and its corresponding threshold value
τ
r
in a similar procedure we use to find text bounding boxes. Second, we rotate
raw character bounding boxes to obtain a rotated version of the boxes by using
the orientation map. As shown in the equation 4.3 (same as in [1]), the angle
of the character is found by taking the arctangent of accumulated sine and
cosine values at the predicted orientation map.
θ
box
= arctan(
P
(S
r
(p) × (S
sin
(p) 0.5))
P
(S
r
(p) × (S
cos
(p) 0.5))
) (4.3)
where θ
box
denotes the predicted orientation angle of the character box.
4. For generating organized character bounding boxes for the connec-
tor. As shown in figure 4.19, 4.12 and figure 4.8, the connection module takes
two inputs:
shared feature map of shape (h/2, w/2, 130) providing which features to
be pooled
organized character bounding boxes providing where to pool fea-
tures. Character bounding boxes obtained from above procedure are fur-
ther grouped into textlines by using links in link map. Character boxes
in each textline are then sorted based on values provided by order map.
Figure 4.18 shows the detailed procedure of creating this input.
4.2.3 Connector
Our main contribution in the UMATS model is the connection component which
allows the detector and the recognizer to connect to each other more efficiently in
both forward and backward passes. This component operates in such a way that it
doesn’t require any rectification module to make a textline straight (see figure 4.19).
Therefore, it reduces the model complexity and also makes UMATS more general
to arbitrary-shape text.
In the inference phase, the connector takes two inputs which are the shared
feature maps and the detection organized character bounding boxes. Then, it makes
arbitrary-shape textlines encoded in shared feature maps horizontally straight and
67
Figure 4.18: The organized character bounding boxes generation process
Figure 4.19: The architectural overview of the connector
outputs the pooled textline feature map. Figure 4.20 shows an example of 130
pooled textline feature maps ready to be fed into the recognizer.
In the training phase, the connector takes two inputs which are the shared feature
maps and the GT organized character bounding boxes. The process of finding the
pooled textline feature maps is completely the same as in the inference phase. The
only difference is that, in the inference phase, the number of the pooled textlines
is based on the performance of the detector in detecting textlines. However, in the
68
(a) The prediction organized character boxes
(b) 130 pooled maps in grayscale colormap
Figure 4.20: The pooled textline feature maps
training phase, the number of the pooled textlines only depends on the number
of textlines labeled in training images. In our implementation, we actually set a
threshold value for the maximum number of textlines to be pooled by the connector
in each iteration in order to avoid the out-of-memory error. Hence, this value is also
the maximum batchsize of the recognition component. Now, if we look back at figure
4.19 we will see that there is one more phase left; the special-case training phase.
In the field of scene text detection and recognition, training datasets with character
bounding boxes are very rare. On the other hand, recognizer, especially Chinese
scene text recognizer, requires a lot of training data in order to avoid imbalanced data
between character classes. In fact, existing text spotting datasets are not enough
for the recognizer to converge during training. we need a way to take advantage
of existing recognition datasets. Thus, we design a new phase called the special-
case training phase specifically for the case when the training datasets having no
organized character bounding boxes but textline bounding boxes and transcriptions.
In this special-case training phase, the connector takes two inputs which are the
shared feature maps and the GT textline bounding boxes. Then, the connector, like
other perspective RoI pooling layers, pools textline feature regions from the shared
feature maps into fixed-spatial-shape feature maps. In our implementation, we set
the spatial size to (h, w) = (16, 64) as in [1].
The connection component consists of three subcomponents: perspective RoI
pooling module, concatenation module, and batching module. The figure 4.21 de-
picts the schematic architecture of the connection component.
69
Figure 4.21: The detailed architecture of the connector. Here we use images for an
easier-to-understand illustration of feature maps.
In perspective RoI pooling module, we again have three sub-modules which are
transformation matrix generator, grid generator, and sampler. Transformation ma-
trix generator computes a perspective transformation matrix T from a character box
to its standing character box of spatial size (16, 16). Then grid generator takes de-
sired destination grid G
0
of the output map and the inverse matrix T
1
to compute
the input character grid G. Lastly, the sampler warps the input character feature
map region to the standing character box by using bilinear interpolation and the
input character grid G. Figure 4.22 depicts how the perspective RoI pooling module
works.
Figure 4.22: How the perspective RoI pooling works. Here we use images for an
easier-to-understand illustration of feature maps.
After getting pooled character feature maps, the feature maps of characters be-
longing to the same textline are concatenated by using concatenation module. For
example, if an input textline is a string of 5 characters, then the concatenated
textline feature map will be of shape (16, 16 × 5) = (16, 80). Finally, the textline
feature maps are zero-padded (or resized) along the width dimension so that the
batching module can group them into a minibatch.
If we’re in the special-case training phase, then the concatenation module is
longer needed because we apply the perspective RoI pooling module directly to the
textline bounding boxes.
In our implementation, we refer to OpenCV for how they compute the perspec-
tive transformation matrix in image domain. The figure 4.23 shows the summariza-
tion of the procedure:
5
https://stackoverflow.com/
70
Figure 4.23: How to calculate perspective transformation matrix
5
4.2.4 Recognizer
The modules in the recognizer are desgined based on the results reported in
[1, 12, 33]. The figure 4.24 demonstrates the overview of the recognizer. There are
three modules in the recognition component: feature extraction, sequence modeling,
and prediction module. The feature extraction module is made lighter than a solitary
recognizer since it takes high-level semantic features returned by the connector as
input. Detailed architecture of the module is shown in table 4.1. After extracting
the visual features, a bidirectional LSTM is applied for sequence modeling, and CTC
decoder makes a final text prediction.
Figure 4.24: The overview of the recognizer. Here we use an image as the recognizer’s
input but actually the recognizer takes feature maps extracted by the previous de-
tection module in the model as input.
The architecture configuration of the sequence modeling and prediction module
are shown in table 4.2
The objective function, L
rec
, in the recognition component is,
L
rec
=
X
i
p(Y
i
|X
i
) (4.4)
where p(Y
i
|X
i
) indicates the generation probability of the character sequence,
Y
i
, from the cropped feature representation, X
i
of the i-th textline box.
The final loss, L, used for training is composed of detection loss and recognition
loss by taking,
L = L
det
+ L
rec
(4.5)
The overall flow of the recognition loss is shown in figure 4.8. The loss flows through
the weights in the recognition stage, and propagates towards detection stage through
the connection module. Detection loss on the other hand is used as an intermediate
71
Table 4.1: The configuration of the feature extraction module. The output of the
feature extraction module is 65 visual 512-dimensional vectors.
Table 4.2: The configuration of the sequence modeling and prediction module. In
which, h means the hidden size of LSTM cell and o means the output size of a layer.
In our implementation, we train our model to recognize 6801 distinct classes includ-
ing 6799 classes for Chinese, Latin, and number characters; 1 class for ‘CTCBlank’
character; and the last class for the other characters.
loss, and thus the weights before detection stage are updated using both detection
and recognition losses.
72
Chapter 5
Implementation and Evaluation
In the previous chapter, we learned about our proposed model. We know the
reasons for choosing components for the model. We also know about their spe-
cific architectural configurations and operating mechanisms. So in this chapter, we
will report important information related to the implementation and the develop-
ment process of some experimental models. Then, we will validate the effectiveness
of these experimental models by comparing them on many different aspects with
today’s best models.
5.1 Experimental models
Specifically, we have three experimental models for both detection and text-
spotting tasks:
1. The first model is CRAFT but with ResNet50 backbone as in [1]. We also
replace affinity map with link map which is proved to be more efficient.
From now on, we refer to this experimental model as Modified CRAFT
or MCRAFT.
2. The second model is the detector of CRAFTS. We refer to this experimental
model as CRAFTS detector or CRAFTSDetector.
3. The last model is our proposed model named UMATS. Recall that UMATS
is already described in the last chapter as a tightly coupled detection and
recognition model.
So for the detection task, we conducted experiments with three models MCRAFT,
CRAFTSDetector, and UMATS detector. In essence, the CRAFTSDetector model
is just the MCRAFT model with an additional map in the output called the ‘ori-
entation map’. Likewise, the UMATS detector is nothing but CRAFTSDetector
model with one more output map called the ‘order map’. The main purpose of this
experiment is to assess the performance of these detectors. Another target is to
evaluate the effect of the new maps as they are added to these detectors’ output.
Then for the text-spotting task, we evaluate the UMATS model on many aspects
such as: how good the connection component is in sharing features between detector
and recognizer and in allowing end-to-end backpropagation from the recognition
module to the detection module, how good the whole model is for text spotting in
general and for different kinds of objects in particular, how effective the model is in
comparison with other state-of-the-art models, etc.
73
5.2 Datasets
Due to the lack of Nom-related datasets, we will instead use some Chinese
datasets which are pretty similar as Nom characters. Later, we can easily see that
the trained experimental models have completely acceptable results on the target
objects only by training on these datasets.
5.2.1 ReCTS2019
The ReCTS2019
1
is a dataset from the Chinese character detection and recog-
nition competition on signboards and billboards at the ICDAR2019 conference.
Here are some descriptions of the dataset:
The dataset consists of 25,000 photos of signboards using mostly Chinese char-
acters.
20,000 images are training images.
5,000 images are testing images.
There is no validation dataset.
It contains a quadrilateral bounding box for each textline of characters and its
corresponding transcription.
It contains a standing rectangular bounding box for each character and its
corresponding transcription.
Each textline or character has an additional label called "ignore" which indi-
cates whether or not the annotators can recognize text in that region.
If a textline or character is unrecognizable, then its transcription will be
‘###’.
It contains mainly Chinese characters, and sometimes even Latin characters,
number characters, and other special characters that often appear on sign-
boards.
The shape of the character sequence in the dataset is varied: horizontal
textlines, vertical textlines, multi-oriented textlines, etc.
This dataset can be used for:
character recognition
textline recognition
textline detection
end-to-end text spotting
Please note that the raw training dataset is not clean. Hence, we have to do
some preprocessing procedures and filter out unusable images for our tasks. Finally,
we only use 19,959 images for the training process.
1
https://rrc.cvc.uab.es/
74
Figure 5.1: Example of some ReCTS2019 dataset images
5.2.2 SynthText
The SynthText
2
is a synthetically generated dataset, in which word instances
are placed in natural scene images, while taking into account the scene layout. Here
are some descriptions of the dataset:
It contains 858,750 synthetic scene-image image files (.jpg) split into 200 di-
rectories.
It consists of 7,266,866 word-instances.
It consists of 28,971,487 characters.
It contains mainly Latin characters, as well as number characters and a few
special characters.
Each text instance is annotated with its text-string, word-level and character-
level bounding-boxes.
Bounding boxes are of quadrilateral shape.
It does not have a testing dataset and validation dataset.
Words are embedded synthetically on natural scene images. Therefore, words
can be in any orientation.
Almost all competitions evaluate the model’s ability to Latin characters no mat-
ter what dataset is used in these competitions. Hence, although this dataset does
not contain Chinese characters, it is still used for the training process such that the
comparisons between the experimental models and other models are fairer.
5.2.3 Chinese Synthetic String dataset
This is a dataset commonly used in Chinese Optical Character Recognition
(OCR):
2
https://www.robots.ox.ac.uk/
75
Figure 5.2: Example of some SynthText dataset images
A total of 3.64 million pictures are divided into training dataset and validation
dataset according to 99:1.
Currently, only 3,279,606 images are training images with annotations pro-
vided.
Only 364,400 images are testing images with annotations provided.
Using Chinese corpus (news + classical Chinese), the data is randomly gen-
erated through changes in font, size, grayscale, blur, perspective, stretching,
etc.
It contains 5,990 character classes such as Chinese characters, English letters,
numbers, and punctuation.
Each sample is fixed with 10 characters, and the characters are randomly
intercepted from the sentences in the corpus.
Image resolution is 280 × 32.
76
Figure 5.3: Example of some Chinese Synthetic String dataset images
5.2.4 Chinese Street View Text dataset
A total of 290,000 pictures are included, of which 210,000 images with labels
(80,000 images without labels) are used as the training dataset (testing dataset).
The dataset is collected from the Chinese Street View and is formed by cutting out
the text line areas (such as shop signs, landmarks, etc.) in the street view pictures.
All the images are preprocessed. Specifically, by using affine transformation, the
text area is proportionally mapped to a picture with a height of 48 pixels, as shown
in figure 5.4.
Figure 5.4: Example of some Chinese Street View Text dataset images
5.3 Implementation
All source code in this research is written by me!!! Therefore, the
model implementation process takes place sequentially in the steps from selecting
the environment, selecting resources, to building its components.
77
5.3.1 Development environemt
Below are the components in our development environment:
Programming language: Python 3.8.5
Framework (Libraries): Pytorch 1.7.1, TorchVision 0.8.2
Platform: Ubuntu 20.04.2 LTS 64-bit
IDE: Vim editor 8.1, Tmux 3.0a, Gnome Terminal 3.36.2, etc.
Online services: Google Collab, GitHub, etc.
Hardwares:
CPU: Intel Core i5-9400F (2.9GHz turbo up to 4.1GHz, 6 cores, 6 threads,
9MB Cache, 65W)
GPU: VGA ASUS ROG Strix GeForce RTX 2070 SUPER OC edition
8GB GDDR6
Memory: 2 × Ram ADATA XPG SPECTRIX D50 RGB 8GB (1x8GB)
DDR4 3200MHz
etc.
5.3.2 Training strategy
First, we train MCRAFT, CRAFTSDetector and UMATS detector with Syn-
thText dataset for 2 epochs. For this dataset, Gaussian blurring, cropping (and
maybe rotation depending on which object we want to focus on) are data augmen-
tation methods used. Then, we train these models with SynthText and ReCTS2019
with ratio 1:5 for 100 epochs. For ReCTS2019, we apply 5 (or 6) data augmenta-
tion methods which are Gaussian blurring, noise adding, color jittering, cropping,
padding (and rotation). Adam optimizer is used, and On-line Hard Negative Mining
(OHEM) [38] is applied to enforce 1:3 ratio of positive and negative pixels in the
detection loss. To minimize the effect of data imbalance between character classes
when training the recognizer, we first freeze the trained UMATS detector and train
the UMATS recognizer with a combination of Chinese Synthetic String dataset and
Chinese Street View Text dataset for 5 epochs. These two datasets are used in the
special-case training phase of the connection module because they lack bounding
boxes for characters. Being unable to create GT maps for training due to the lack
of character bounding boxes is also one of the main reasons we freeze the detector.
The recognition loss is then optimized by using Adadelta optimizer. One thing to
note is that the connection module doesn’t contain any trainable parameters, it’s
just like a function or a layer with no state. For these recognition datasets, Gaussian
blurring, noise adding and color jittering are used as the data augmentation meth-
ods. Lastly, we jointly train both the detector and the recognizer in the UMATS
for 15 epochs using SynthText and ReCTS2019 datasets with the same ratio as be-
fore. To avoid out-of-memory error, we apply some techniques like Automatic Mixed
Precision (AMP), Gradient Checkpointing, Time-Memory Trade-offs, etc.
Below are the summarization of important implementation configurations:
DET _BAT CH_SIZE = 4
78
REC_BAT CH_SIZE = 128 if we freeze UMATS detector.
REC_BAT CH_SIZE = 32 if we jointly train UMATS detector and recog-
nizer.
T RAIN_IMAGE_SIZE = (768, 768)
DET _LEARNING_RAT E = 0.5e 4
DET _W EIGHT _DECAY = 1e 2
REC_LEARNING_RAT E = 1.0
REC_RHO = 0.95
REC_EP S = 1e 8
REC_W EIGHT _DECAY = 1e 5
5.4 Experimental results
5.4.1 Results of the detection models
As mentioned before, this experiment is conducted to evaluate, compare the
performance of the experimental detectors on benchmark datasets compared to other
models as well as between experimental models to see the effects and contributions
of the proposed improvements.
For targeting ‘Task 3 - Text Line Detection’ of ICDAR 2019 Robust Reading
Challenge on Reading Chinese Text on Signboard
3
, we evaluate the performance
of the experimental detection models on the testing set of the ReCTS2019 dataset.
Like any other teams, we uploaded our detection results as .txt files to the evaluation
server of the competition in order to get our final scores. Following the evaluation
protocols of ICDAR 2017-RCTW [34] dataset, this task is evaluated in terms of
Precision, Recall and Hmean (F-score) with IoU threshold of 0.5 and 0.7. The
Hmean at IoU = 0.5 will be used as the only metric for the final ranking. All
detected or missed ‘ignored’ ground-truths will not contribute to the evaluation
result. In table 5.1, we show the results of several models on this challenge.
We can see that the MCRAFT model outperforms the models provided by
CRAFT’s authors by a large margin. Even CRAFTSDetector and UMATS de-
tector which has to optimize their parameters for one and two more output maps
still outperform the CRAFT model provided by the author on the repo by a large
margin. Another interesting point to note is that although adding the orientation
map reduces the effectiveness of the model (from MCRAFT to CRAFTSDetector),
the use of the order map increases the model’s character detection efficiency (from
CRAFTSDetector to UMATS detector). Note that the order map is not added to
the output with the intent of increasing the performance of the detector in mind.
It’s created to provide connection module with important information about char-
acters and we hope that it maybe boosts the detection performance after training.
Based on the results, we conclude that the information about the appearance or-
der or characters in the sequence does help to enhance detection performance in
3
https://rrc.cvc.uab.es/
4
https://rrc.cvc.uab.es/?ch=12&com=evaluation&task=3
5
https://github.com/clovaai/CRAFT-pytorch
79
Table 5.1: The results of several detection models on ReCTS2019 Task 3
4
Model Rank Recall Precision Hmean
MCRAFT (sec. 5.1) 16 87.12% 87.75% 87.44%
CRAFTSDetector (sec. 5.1) 21 81.12% 91.31% 85.91%
UMATS detector (sec. 4.2.2) 20 85.44% 87.15% 86.29%
UMATS detector* (sec. 4.2.2)
(*: after E2E training UMATS)
19 85.14% 88.30% 86.69%
CRAFT [31]
(pre-trained general model
from authors’ repo
5
)
31 75.61% 77.01% 76.31%
CRAFT [31]
(model uploaded by authors)
17 85.33% 89.38% 87.31%
EAST [9] 22 82.27% 88.49% 85.27%
TextBoxes++ [6] 23 87.02% 81.23% 84.03%
the UMATS detector. Moreover, after we jointly train both UMATS detector and
recognizer, we get a significant improvement in textline detection task of UMATS
detector. This demonstrates the effectiveness of the proposed connection module
in sharing information back and forth between two important components of the
UMATS model.
Now, we compare the speed of detection between MCRAFT, CRAFTSDetector,
UMATS detector(*), and other models. As shown in table 5.2, our experimental
models outperform current state-of-the-art models by a large margin in terms of
inference speed. Moreover, our models even meet common standards about real-
time speed ( 25 FPS) for most of the applications. The most surprising and
amazing thing about our implementation was that our experimental models are
about three times faster than CRAFT [31] - the model we were based on - while
maintaining competitive high accuracy. Our models are even faster than FOTS [3],
a Fast Oriented Text Spotting with a Unified Network’ model, which proves the
effectiveness of our models. Another point to note is that CRAFTSDetector model is
slightly slower than MCRAFT because CRAFTSDetector has to compute one more
output map, the orientation map. The same logic applies to the UMATS detection
model.
We show some visual comparisons between MCRAFT, CRAFTSDetector, UMATS
detector* and pre-trained CRAFT model in figures 5.5, 5.6, 5.7, and 5.8.
80
Table 5.2: Detection speed of different popular models. Data are referenced from
[31].
Method FPS Method FPS Method FPS
RRD* [39] 10 PixelLink* [19] 3.0
Mask
TextSpotter [15]
4.8
EAST* [9] 13.2 TextSnake [20] 1.1 R2CNN [40] 0.4
Wordsup [41] 1.9 SegLink [7] 20.6 SSTD [42] 7.7
OBD [11] 0.83 PMTD [10] 4.5 TextField [21] 6.0
CRAFT [31] 8.6 TextBoxes++* [6] 2.3 FOTS [3] 23.9
MCRAFT 24.14 CRAFTSDetector 24.08 UMATS detector 24.05
(a) MCRAFT (b) CRAFTSDetector
(c) UMATS detector* (d) CRAFT
Figure 5.5: Visual detection results for occluded, partly-captured, and blurred char-
acters. Experimental models surpass the pre-trained CRAFT model in these cir-
cumstances. Please note that we use some data augmentation methods such as
noise adding, blurring, and cropping during our training phase which improves the
capability of the experimental models on these cases.
81
(a) The generation process of the groundtruth region map and affinity map in
CRAFT [31]
(b) MCRAFT (c) CRAFTSDetector
(d) UMATS detector* (e) CRAFT
Figure 5.6: Visual detection results for textlines with large gaps between contiguous
characters. In the affinity map of CRAFT, the warped isotropic Gaussian distribu-
tions modelizing affinities between two adjacent characters are negatively distorted
when there are large space between those characters. On the other hand, the link
map used in the experimental models modelizes the relationship of two adjacent
characters belonging to the same text line by a straight line connecting two char-
acter’s centers. Therefore, its modeling capabilities are stable to a wide variety of
practical contexts.
82
(a) MCRAFT (b) CRAFTSDetector
(c) UMATS detector* (d) CRAFT
Figure 5.7: Visual detection results for vertical textlines. The way CRAFT generates
the groundtruth affinity map will make the affinity between two adjacent characters
in a vertical line disappear. Those bad generated GT maps will affect the quality
of the training process of CRAFT on vertical textlines.
(a) MCRAFT (b) CRAFTSDetector
(c) UMATS detector* (d) CRAFT
Figure 5.8: Visual results of textlines detection on complex backgrounds. All models
still sometimes misunderstand and bound the image regions with special textures
as if they are textlines of characters.
83
5.4.2 Results of the UMATS text-spotting model
For ‘Task 4 - End-to-End Text Spotting’ of the ReCTS2019 competition
6
, the
results of several models are shown in table 5.3. We evaluate the performance of
the experimental models on the testing set of the ReCTS2019 dataset. Like any
other teams, we uploaded our results as .txt files to the evaluation server of the
competition in order to get our final scores. First, each detection is matched to a
ground-truth polygon that has the maximum IoU, or it is matched to ‘None’ if none
IoU is larger than 0.5. If multiple detections are matched to the same ground-truth,
only the one with the maximum IoU will be kept and the others are recorded as
‘None’. Then, we calculate the edit distances between all matching pairs (s
i
,bs
i
). We
will evaluate the predicted transcription with the Normalized Edit Distance (NED),
which is formulated as:
N = 1
1
N
N
X
i=1
D(s
i
, bs
i
)/max(s
i
, bs
i
) (5.1)
where D stands for the Levenshtein Distance, s
i
denotes the predicted text line and
bs
i
denotes the corresponding ground truth. N is the total number of text lines.
Table 5.3: The results of several models on ReCTS2019 Task 4 - End-to-End Text
Spotting
7
Model Recall Precision Hmean 1-NED
UMATS (sec. 4.2) 83.12% 81.32% 82.21% 44.01%
UMATS* (sec. 4.2)
(*: after E2E training UMATS)
82.91% 85.90% 84.38% 54.61%
submit2 69.49% 89.52% 78.24% 50.36%
CRAFTS [1] 75.89% 78.44% 77.14% 41.68%
We can easily see that our proposed model outperforms its base version a sig-
nificantly large margin. It is a clear demonstration of the positive effect of the
connection component on the overall efficiency of the model. In particular, there
was a significant improvement in the recognition accuracy of the UMATS model
after end-to-end training (from 44.01% of UMATS to 53.67% of UMATS*). The
two main reasons for the above results are:
During end-to-end training, we can optimize the detection network’s trainable
parameters for our task of recognizing characters only by back-propagating
gradients from the recognition loss to the whole model.
We train the UMATS* model using the ReCTS2019 training set, which con-
tains character classes that are untrained or unbalanced when training detec-
tion and recognition components separately.
For the comparison of speed, we compare it with other text spotting models.
The results are shown in table 5.4. Note that their models are for Latin characters
and numbers, not for Chinese.
6
https://rrc.cvc.uab.es/
7
https://rrc.cvc.uab.es/?ch=12&com=evaluation&task=4
84
Table 5.4: Text spotting speed of different popular models. ‘*’ denotes the results
based on multiscale tests.
Method FPS Method FPS Method FPS
Deep
TextSpotter [43]
9 TextNet* [5] 2.7
MaskTextSpotter*
[15]
2.0
Qin et al. [44] 4.7 FOTS* [3] 7.5 Li et al. [45] 1.3
UMATS
(for Latin)
8.55*
UMATS
(for Chinese)
8.28* CRAFTS [1] 5.4
We can see that our experimental models have a competitive performance to
other emerging models. Especially, our models are faster than CRAFTS thanks to
the newly-designed connection module which efficiently connects the detection com-
ponent to the recognition component without the need for the rectification module,
an additional complex neural network. An important point to note is that the speed
of the text-spotting models is affected by the detection performance. The more text
lines the detector can locate, the longer time it takes the recognizer to decode those
text lines.
Some text spotting visual results on ReCTS2019 testing dataset are shown in
figures 5.9 and 5.10.
Thanks to the connection module, the UMATS model can easily deal with
arbitrary-shape text. Figure 5.11 depicts some of the results on arbitrary-shape
text.
Our proposed model is therefore a good model for Han Nom textline detection
and recognition on natural scene images. As shown in figures 5.12 and 5.13, distichs,
horizontal lacquered boards, and other textlines in historical documents can be easily
spotted by our model. Especially, our model can easily deal with vertical textline
recognition problems by pooling each character region and organizing them in a
horizontal line.
85
Figure 5.9: Qualitative results on ReCTS2019 dataset
86
Figure 5.10: Qualitative results on ReCTS2019 dataset
87
Figure 5.11: Visual results on arbitrary-shape text
88
Figure 5.12: Horizontal lacquered boards and historical sites
89
Figure 5.13: Distichs
90
Chapter 6
Conclusions and Future work
6.1 Conclusions
In this thesis, I present an end-to-end trainable single pipeline model that tightly
couples detection and recognition modules. The effective perspective character RoI
pooling in the connection module not only helps rectifying arbitrary-shape textlines
but also lets recognition loss from the recognizer back-propagate through the whole
network easily. we no longer need a separate rectification module and therefore
reduce the model’s complexity while maintaining high-performance results. Addi-
tionally, the model is designed with modularization in mind and the source code
is written 100% by me so that it can be easy to develop the model in the future.
Moreover, to the best of our knowledge, this is the first model aimed to solve the
Han Nom text detection and recognition problem. So hopefully, this will be a good
reference material for other researchers in the future.
6.2 Future work
In the future, we plan to refine the model to improve both its accuracy and
speed:
replace the backbone network with the current state-of-the-art methods like
EfficientNet, FixEfficient-L2, etc.
replace CTC decoder with some newly-designed decoders, etc.
replace the Gaussian distribution for each character region with some other
distributions, etc.
train model additionally with other datasets
support other languages such as Korean, Japanese, etc.
take advantage of weakly-supervised training [1] and train the model with
much more data.
91
List of Figures
1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Artificial neuron structure . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Popular activation functions . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Feedforward neural network structure . . . . . . . . . . . . . . . . . . 6
2.4 The shape of several neural network volumes . . . . . . . . . . . . . . 7
2.5 Convolutional neural network architecture . . . . . . . . . . . . . . . 7
2.6 Convolution operation . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Convolution layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Depth column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.9 How zero-padding affects the spatial size of the output volume . . . . 10
2.10 Max pooling layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.11 Recurrent neural network architecture . . . . . . . . . . . . . . . . . 12
2.12 Neuron structure of a recurrent neural network . . . . . . . . . . . . . 12
2.13 RoI pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.14 Example of an feature map . . . . . . . . . . . . . . . . . . . . . . . . 17
2.15 Example of an region proposal . . . . . . . . . . . . . . . . . . . . . . 17
2.16 2x2 pooling sections . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.17 Pooled feature map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.18 The quantization of RoI pooling . . . . . . . . . . . . . . . . . . . . . 18
2.19 RoI Align . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.20 How RoI Align calculate output for each smaller region . . . . . . . . 19
2.21 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.22 Shallow and deep learning . . . . . . . . . . . . . . . . . . . . . . . . 21
2.23 Segmentation types . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.24 Faster R-CNN and Mask R-CNN . . . . . . . . . . . . . . . . . . . . 22
2.25 Image interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.26 Linear interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.27 Quadratic interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.28 Interpolation example using resizing . . . . . . . . . . . . . . . . . . . 24
2.29 Interpolation example using rotation . . . . . . . . . . . . . . . . . . 24
2.30 Bilinear Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Text spotting architecture 1 . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Text spotting architecture 2 . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Text spotting architecture 3 . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 The architecture of CharNet . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 The architecture of Hourglass . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Different components of the Hourglass network . . . . . . . . . . . . . 33
3.7 Iterative Character Detection . . . . . . . . . . . . . . . . . . . . . . 34
3.8 First limitation of Mask R-CNN based methods . . . . . . . . . . . . 35
92
3.9 Second limitation of Mask R-CNN based methods . . . . . . . . . . . 35
3.10 Soft pyramid label . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.11 Overall architecture of PMTD . . . . . . . . . . . . . . . . . . . . . . 36
3.12 Generation of soft pyramid label . . . . . . . . . . . . . . . . . . . . . 36
3.13 Plane algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.14 The architecture of OBD network . . . . . . . . . . . . . . . . . . . . 37
3.15 Comparison between OBD and other previous methods . . . . . . . . 38
3.16 Illustration of the OBD and MTL blocks . . . . . . . . . . . . . . . . 39
3.17 Illustration of different matching types . . . . . . . . . . . . . . . . . 39
3.18 The architecture of FOTS . . . . . . . . . . . . . . . . . . . . . . . . 41
3.19 Illustration of RoIRotate . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.20 The pipeline of ContourNet . . . . . . . . . . . . . . . . . . . . . . . 44
3.21 Adaptive RPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.22 The visualization of LOTM . . . . . . . . . . . . . . . . . . . . . . . 46
3.23 Schematic overview of CRAFTS pipeline . . . . . . . . . . . . . . . . 47
3.24 The backbone of CRAFTS . . . . . . . . . . . . . . . . . . . . . . . . 48
3.25 The architecture of CRNN . . . . . . . . . . . . . . . . . . . . . . . . 51
3.26 The receptive field . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.27 Structure of the SRN . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 The space between adjacent characters . . . . . . . . . . . . . . . . . 55
4.2 Long Chinese sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Chinese vertical textline . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 The orientation of Latin characters . . . . . . . . . . . . . . . . . . . 57
4.5 The orientation of Chinese characters . . . . . . . . . . . . . . . . . . 57
4.6 Pooling method comparison . . . . . . . . . . . . . . . . . . . . . . . 58
4.7 Order map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.8 The overview of the proposed UMATS architecture . . . . . . . . . . 60
4.9 The detailed architecture of the proposed UMATS . . . . . . . . . . . 60
4.10 Visualization of region, link, and order maps . . . . . . . . . . . . . . 61
4.11 Visualization of orientation-angle-related maps . . . . . . . . . . . . . 62
4.12 The architecture of UMATS detector . . . . . . . . . . . . . . . . . . 63
4.13 The generation process of GT region map . . . . . . . . . . . . . . . 64
4.14 The generation process of GT link map . . . . . . . . . . . . . . . . . 64
4.15 The generation process of GT orientation map . . . . . . . . . . . . . 65
4.16 The GroundTruth maps . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.17 Polygon generation for arbitrarily-shaped texts . . . . . . . . . . . . . 67
4.18 The organized character bounding boxes generation process . . . . . . 68
4.19 The architectural overview of the connector . . . . . . . . . . . . . . 68
4.20 The pooled textline feature maps . . . . . . . . . . . . . . . . . . . . 69
4.21 The detailed architecture of the connector . . . . . . . . . . . . . . . 70
4.22 How the perspective RoI pooling works . . . . . . . . . . . . . . . . . 70
4.23 How to calculate perspective transformation matrix . . . . . . . . . . 71
4.24 The overview of the recognizer . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Example of some ReCTS2019 dataset images . . . . . . . . . . . . . . 75
5.2 Example of some SynthText dataset images . . . . . . . . . . . . . . 76
5.3 Example of some Chinese Synthetic String dataset images . . . . . . 77
5.4 Example of some Chinese Street View Text dataset images . . . . . . 77
5.5 Visual results for occluded, partly-captured, and blurred characters . 81
5.6 Visual results for textlines with large gaps between characters . . . . 82
93
5.7 Visual results for vertical textlines . . . . . . . . . . . . . . . . . . . . 83
5.8 Visual results of textlines detection on complex backgrounds . . . . . 83
5.9 Qualitative results on ReCTS2019 dataset . . . . . . . . . . . . . . . 86
5.10 Qualitative results on ReCTS2019 dataset . . . . . . . . . . . . . . . 87
5.11 Visual results on arbitrary-shape text . . . . . . . . . . . . . . . . . . 88
5.12 Horizontal lacquered boards and historical sites . . . . . . . . . . . . 89
5.13 Distichs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
94
List of Tables
2.1 Different types of gates of an RNN . . . . . . . . . . . . . . . . . . . 13
2.2 GRU and LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Variants of RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Simplified Resnet50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Comparison between different models . . . . . . . . . . . . . . . . . . 49
4.1 The configuration of the feature extraction module . . . . . . . . . . 72
4.2 The configuration of the sequence modeling and prediction module . . 72
5.1 The results of several detection models on ReCTS2019 Task 3 . . . . 80
5.2 Detection speed of different popular models . . . . . . . . . . . . . . 81
5.3 The results of several models on ReCTS2019 Task 4 . . . . . . . . . . 84
5.4 Text spotting speed of different popular models . . . . . . . . . . . . 85
95
Bibliography
[1] Y. Baek, S. Shin, J. Baek, S. Park, J. Lee, D. Nam, and H. Lee, “Character
region attention for text spotting,” 2020.
[2] H. Li, P. Wang, and C. Shen, “Towards end-to-end text spotting with
convolutional recurrent neural networks,” CoRR, vol. abs/1707.03985, 2017.
[Online]. Available: http://arxiv.org/abs/1707.03985
[3] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan, FOTS: fast
oriented text spotting with a unified network,” CoRR, vol. abs/1801.01671,
2018. [Online]. Available: http://arxiv.org/abs/1801.01671
[4] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial
transformer networks,” CoRR, vol. abs/1506.02025, 2015. [Online]. Available:
http://arxiv.org/abs/1506.02025
[5] Y. Sun, C. Zhang, Z. Huang, J. Liu, J. Han, and E. Ding, “Textnet: Irregular
text reading from images with an end-to-end trainable network,” CoRR, vol.
abs/1812.09900, 2018. [Online]. Available: http://arxiv.org/abs/1812.09900
[6] M. Liao, B. Shi, and X. Bai, “Textboxes++: A single-shot oriented
scene text detector,” CoRR, vol. abs/1801.02765, 2018. [Online]. Available:
http://arxiv.org/abs/1801.02765
[7] B. Shi, X. Bai, and S. J. Belongie, “Detecting oriented text in natural images
by linking segments,” CoRR, vol. abs/1703.06520, 2017. [Online]. Available:
http://arxiv.org/abs/1703.06520
[8] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detecting text in natural
image with connectionist text proposal network,” CoRR, vol. abs/1609.03605,
2016. [Online]. Available: http://arxiv.org/abs/1609.03605
[9] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, EAST: an
efficient and accurate scene text detector,” CoRR, vol. abs/1704.03155, 2017.
[Online]. Available: http://arxiv.org/abs/1704.03155
[10] J. Liu, X. Liu, J. Sheng, D. Liang, X. Li, and Q. Liu, “Pyramid
mask text detector,” CoRR, vol. abs/1903.11800, 2019. [Online]. Available:
http://arxiv.org/abs/1903.11800
[11] Y. Liu, T. He, H. Chen, X. Wang, C. Luo, S. Zhang, C. Shen, and
L. Jin, “Exploring the capacity of sequential-free box discretization network
for omnidirectional scene text detection,” CoRR, vol. abs/1912.09629, 2019.
[Online]. Available: http://arxiv.org/abs/1912.09629
96
[12] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
for image-based sequence recognition and its application to scene text
recognition,” CoRR, vol. abs/1507.05717, 2015. [Online]. Available: http:
//arxiv.org/abs/1507.05717
[13] B. Shi, X. Wang, P. Lv, C. Yao, and X. Bai, “Robust scene text recognition
with automatic rectification,” CoRR, vol. abs/1603.03915, 2016. [Online].
Available: http://arxiv.org/abs/1603.03915
[14] L. Xing, Z. Tian, W. Huang, and M. R. Scott, “Convolutional
character networks,” CoRR, vol. abs/1910.07954, 2019. [Online]. Available:
http://arxiv.org/abs/1910.07954
[15] M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai, “Mask
textspotter: An end-to-end trainable neural network for spotting text with
arbitrary shapes,” CoRR, vol. abs/1908.08207, 2019. [Online]. Available:
http://arxiv.org/abs/1908.08207
[16] S. Ren, K. He, R. B. Girshick, and J. Sun, “Faster R-CNN: towards real-time
object detection with region proposal networks,” CoRR, vol. abs/1506.01497,
2015. [Online]. Available: http://arxiv.org/abs/1506.01497
[17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C.
Berg, SSD: single shot multibox detector,” CoRR, vol. abs/1512.02325, 2015.
[Online]. Available: http://arxiv.org/abs/1512.02325
[18] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, “You only look
once: Unified, real-time object detection,” CoRR, vol. abs/1506.02640, 2015.
[Online]. Available: http://arxiv.org/abs/1506.02640
[19] D. Deng, H. Liu, X. Li, and D. Cai, “Pixellink: Detecting scene text via
instance segmentation,” CoRR, vol. abs/1801.01315, 2018. [Online]. Available:
http://arxiv.org/abs/1801.01315
[20] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, “Textsnake: A
flexible representation for detecting text of arbitrary shapes,” CoRR, vol.
abs/1807.01544, 2018. [Online]. Available: http://arxiv.org/abs/1807.01544
[21] Y. Xu, Y. Wang, W. Zhou, Y. Wang, Z. Yang, and X. Bai, “Textfield:
Learning A deep direction field for irregular scene text detection,” CoRR, vol.
abs/1812.01393, 2018. [Online]. Available: http://arxiv.org/abs/1812.01393
[22] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,” CoRR, vol.
abs/1703.06870, 2017. [Online]. Available: http://arxiv.org/abs/1703.06870
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” CoRR, vol. abs/1512.03385, 2015. [Online]. Available:
http://arxiv.org/abs/1512.03385
[24] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,”
CoRR, vol. abs/1808.01244, 2018. [Online]. Available: http://arxiv.org/abs/
1808.01244
97
[25] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie,
“Feature pyramid networks for object detection,” CoRR, vol. abs/1612.03144,
2016. [Online]. Available: http://arxiv.org/abs/1612.03144
[26] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-
scale image recognition,” in International Conference on Learning Representa-
tions, 2015.
[27] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks.”
IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, 1997. [Online].
Available: http://dblp.uni-trier.de/db/journals/tsp/tsp45.html#SchusterP97
[28] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist tem-
poral classification: Labelling unsegmented sequence data with recurrent neural
nets,” in ICML ’06: Proceedings of the International Conference on Machine
Learning, 2006.
[29] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More
deformable, better results,” CoRR, vol. abs/1811.11168, 2018. [Online].
Available: http://arxiv.org/abs/1811.11168
[30] Y. Wang, H. Xie, Z. Zha, M. Xing, Z. Fu, and Y. Zhang, “Contournet: Taking
a further step toward accurate arbitrary-shaped scene text detection,” 2020.
[31] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness
for text detection,” CoRR, vol. abs/1904.01941, 2019. [Online]. Available:
http://arxiv.org/abs/1904.01941
[32] F. Zhan and S. Lu, ESIR: end-to-end scene text recognition via iterative
image rectification,” CoRR, vol. abs/1812.05824, 2018. [Online]. Available:
http://arxiv.org/abs/1812.05824
[33] J. Baek, G. Kim, J. Lee, S. Park, D. Han, S. Yun, S. J. Oh, and H. Lee,
“What is wrong with scene text recognition model comparisons? dataset
and model analysis,” CoRR, vol. abs/1904.01906, 2019. [Online]. Available:
http://arxiv.org/abs/1904.01906
[34] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. J. Belongie,
S. Lu, and X. Bai, ICDAR2017 competition on reading chinese text in
the wild (RCTW-17),” CoRR, vol. abs/1708.09585, 2017. [Online]. Available:
http://arxiv.org/abs/1708.09585
[35] X. Chen, L. Jin, Y. Zhu, C. Luo, and T. Wang, “Text recognition in the wild:
A survey,” 2020.
[36] S. Long, X. He, and C. Yao, “Scene text detection and recognition: The
deep learning era,” CoRR, vol. abs/1811.04256, 2018. [Online]. Available:
http://arxiv.org/abs/1811.04256
[37] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for
biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015. [Online].
Available: http://arxiv.org/abs/1505.04597
98
[38] A. Shrivastava, A. Gupta, and R. B. Girshick, “Training region-based object
detectors with online hard example mining,” CoRR, vol. abs/1604.03540, 2016.
[Online]. Available: http://arxiv.org/abs/1604.03540
[39] M. Liao, Z. Zhu, B. Shi, G. song Xia, and X. Bai, “Rotation-sensitive regression
for oriented scene text detection,” 2018.
[40] Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, and
Z. Luo, R2CNN: rotational region CNN for orientation robust scene
text detection,” CoRR, vol. abs/1706.09579, 2017. [Online]. Available:
http://arxiv.org/abs/1706.09579
[41] H. Hu, C. Zhang, Y. Luo, Y. Wang, J. Han, and E. Ding, “Wordsup:
Exploiting word annotations for character based text detection,” CoRR, vol.
abs/1708.06720, 2017. [Online]. Available: http://arxiv.org/abs/1708.06720
[42] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li, “Single shot text
detector with regional attention,” CoRR, vol. abs/1709.00138, 2017. [Online].
Available: http://arxiv.org/abs/1709.00138
[43] M. Buˇsta, L. Neumann, and J. Matas, “Deep textspotter: An end-to-end train-
able scene text localization and recognition framework,” in 2017 IEEE Inter-
national Conference on Computer Vision (ICCV), 2017, pp. 2223–2231.
[44] S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao, “Towards unconstrained
end-to-end text spotting,” CoRR, vol. abs/1908.09231, 2019. [Online].
Available: http://arxiv.org/abs/1908.09231
[45] H. Li, P. Wang, and C. Shen, “Towards end-to-end text spotting in
natural scenes,” CoRR, vol. abs/1906.06013, 2019. [Online]. Available:
http://arxiv.org/abs/1906.06013
99